R-CNN Family: 20 Essential Q&A

From selective search to RPN and feature pyramids—the two-stage detector story.

~12 min read 20 questions Advanced

RPNRoIFPNCascade

Quick Navigation

1 Original R-CNN steps? 📊 medium

Answer: Propose ~2k regions (selective search) → warp each → CNN features → SVM per class + bbox regressor—no shared conv per region → very slow.

2 Main bottleneck? ⚡ easy

Answer: Running CNN thousands of times per image on warped crops; also disk caching of features in early work.

3 What did Fast R-CNN fix? 📊 medium

Answer: Run CNN once on full image; project RoIs onto feature map → RoI pool to fixed size → heads—big speedup + end-to-end backprop.

4 How RoI pooling works? 📊 medium

Answer: Divide each RoI on feature map into H×W bins; max-pool each bin to fixed output—quantization loses subpixel alignment.

5 What is Faster R-CNN? 🔥 hard

Answer: Replaces selective search with RPN that shares full-image conv features—learned proposals, joint training with detector.

6 What does the RPN output? 🔥 hard

Answer: At each anchor location: objectness logits and box deltas to refine anchors—proposals passed to RoI head.

7 Anchor scales/aspect ratios? 📊 medium

Answer: Multiple templates per location cover different object shapes; k anchors per cell → many candidate boxes before filtering by score + NMS.

8 Losses in Faster R-CNN? 🔥 hard

Answer: RPN: binary CE for objectness + smooth L1 for box deltas on assigned anchors; detector head: multi-class CE + bbox regression on positive RoIs.

9 Why FPN? 🔥 hard

Answer: Semantic single high-level feature map is weak for small objects—FPN builds a top-down pyramid with lateral connections for multi-scale RoI features.

10 RoIAlign role? 📊 medium

Answer: Bilinear sample features at exact RoI locations—used in Mask R-CNN for alignment-sensitive mask prediction.

11 What is Cascade R-CNN? 🔥 hard

Answer: Sequence of detector stages with increasing IoU thresholds for positives—reduces overfitting to low-quality proposals and improves AP.

12 NMS placement? ⚡ easy

Answer: After RPN (proposal NMS) and usually after final class-specific boxes—removes duplicate detections.

13 Approximate joint training? 📊 medium

Answer: Alternating or 4-step training historically; modern implementations use single loss with shared backbone and careful sampling.

14 Two-stage strength? ⚡ easy

Answer: Typically higher mAP especially on challenging datasets vs comparable-era one-stage; slower inference.

15 Mask R-CNN? 📊 medium

Answer: Adds mask branch to Faster R-CNN with RoIAlign—instance segmentation with modest overhead.

16 Keypoint R-CNN? 📊 medium

Answer: Same framework with one-hot masks per keypoint or heatmap head—used for pose.

17 Deformable conv in detectors? 🔥 hard

Answer: Offsets sampling grid in conv—better geometric modeling for deformable objects; used in RefineDet / DCN backbones.

18 What is HTC? 🔥 hard

Answer: Hybrid Task Cascade—interleaves detection and segmentation stages with feature fusion—strong COCO instance segmentation.

19 DETR vs R-CNN? 📊 medium

Answer: DETR removes anchors/NMS with transformers—simpler pipeline but different training dynamics and compute.

20 When choose two-stage today? ⚡ easy

Answer: When max accuracy matters and latency budget allows, or when building on mature frameworks (Detectron2) with many pretrained configs.

R-CNN Family Cheat Sheet

Evolution

R-CNN → Fast
Faster + RPN

Add-ons

FPN
RoIAlign

Accuracy

Cascade
HTC

💡 Pro tip: Faster R-CNN = shared backbone + RPN proposals + RoI head.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.

R-CNN Tutorial

Previous Next

Related Computer Vision Links