Related Deep Learning Links
Learn Object Detection Deep Learning Tutorial, validate concepts with Object Detection Deep Learning MCQ Questions, and prepare interviews through Object Detection Deep Learning Interview Questions and Answers.
Object Detection
20 Essential Q/A
CV Interview Prep
Object Detection: 20 Interview Questions
Master YOLO, SSD, Faster R-CNN, RetinaNet, DETR, anchor boxes, NMS, mAP, FPN, and one-stage vs two-stage trade-offs. Interview-ready answers with formulas.
YOLO
Faster R-CNN
SSD
mAP
Anchor Boxes
DETR
1
What is object detection? How is it different from classification and segmentation?
âš¡ Easy
Answer: Object detection = localization (bounding box) + classification. Classification assigns label to whole image; segmentation is pixel-level. Detection outputs variable number of boxes with class labels.
Task: "Where are objects and what are they?"
2
What is IoU? How is it calculated?
âš¡ Easy
Answer: IoU = Area of Overlap / Area of Union between predicted and ground truth boxes. Range [0,1]. Threshold (typically 0.5) determines true/false positive.
IoU = |A ∩ B| / |A ∪ B|
3
Explain mAP (mean Average Precision) for object detection.
🔥 Hard
Answer: For each class, compute Average Precision (area under Precision-Recall curve). Then mean over classes. COCO mAP averages over IoU thresholds (0.5:0.05:0.95). Standard metric for detection.
4
One-stage vs two-stage detectors – trade-offs?
📊 Medium
Answer: Two-stage (Faster R-CNN): region proposal + classification. Higher accuracy, slower. One-stage (YOLO, SSD): dense prediction, faster, better speed-accuracy trade-off. RetinaNet bridges gap with Focal Loss.
Two-stage: accurate, R-CNN family
One-stage: fast, YOLO/SSD
5
What are anchor boxes? How are they designed?
🔥 Hard
Answer: Predefined bounding boxes of various scales/aspect ratios placed at each spatial location. Model predicts offsets to refine anchors. Designed via k-means on training set (YOLOv2) or hand-picked (Faster R-CNN: 3 scales × 3 ratios).
6
How does Non-Maximum Suppression (NMS) work?
📊 Medium
Answer: Removes duplicate detections. Sort boxes by confidence, select highest, suppress others with IoU > threshold. Soft-NMS decays scores instead of removing. NMS is non-differentiable.
7
Explain Faster R-CNN. Role of RPN?
🔥 Hard
Answer: Backbone → Feature maps → Region Proposal Network (RPN) generates candidate boxes (objectness + regression). RoI Pooling crops features → classifier/bbox regressor. End-to-end trainable. RPN replaces selective search.
8
How does YOLO work? Loss function components?
🔥 Hard
Answer: Single regression: divide image into S×S grid, each cell predicts B boxes (x,y,w,h, confidence) and C class probabilities. Loss: localization (xywh), confidence (objectness), classification. Sum-squared error.
9
What makes SSD fast? Multi-scale feature maps?
📊 Medium
Answer: SSD predicts anchors on multiple feature maps at different resolutions (detects objects of various sizes). No RPN, fully convolutional. Faster than Faster R-CNN, competitive accuracy.
10
What is FPN? Why important?
🔥 Hard
Answer: Feature Pyramid Network: top-down pathway + lateral connections. Combines semantic-rich high-level features with high-resolution low-level features. Improves detection of small objects. Used in RetinaNet, Mask R-CNN.
11
What problem does Focal Loss solve? How?
🔥 Hard
Answer: One-stage detectors suffer from extreme class imbalance (many easy negatives). Focal Loss down-weights easy examples, focuses on hard misclassifications. FL(p_t) = -(1-p_t)^γ log(p_t). RetinaNet matches two-stage accuracy.
FL(p_t) = -α_t (1-p_t)^γ log(p_t)
12
How does DETR (Detection Transformer) work?
🔥 Hard
Answer: CNN backbone + Transformer encoder-decoder. Treats detection as set prediction: fixed N learnable object queries, bipartite matching loss (Hungarian). No anchors, no NMS. Slow convergence, but elegant.
13
RoI Pooling vs RoI Align? Why Align better?
🔥 Hard
Answer: RoI Pooling quantizes (floor/ceil) causing misalignment. RoI Align uses bilinear interpolation at sample points, no quantization. Essential for pixel-precise tasks (Mask R-CNN). Improves detection accuracy.
14
What is Online Hard Example Mining?
🔥 Hard
Answer: During training, select top-k highest loss RoIs (hard examples) to backpropagate. Ignores easy negatives. Improves model robustness. Used in SSD, Faster R-CNN variants.
15
How are positive/negative anchors assigned?
📊 Medium
Answer: Positive: IoU > 0.7 (or highest IoU). Negative: IoU < 0.3. Intermediate ignored. YOLO assigns based on center in grid cell. RetinaNet uses IoU threshold 0.5. Adaptive strategies (ATSS) improve.
16
Common data augmentation strategies?
📊 Medium
Answer: Random crop, flip, rotation, color jitter. Mosaic (YOLOv4): combine 4 images. MixUp, CutOut, GridMask. Keep bbox consistency.
17
Instance segmentation vs panoptic segmentation?
📊 Medium
Answer: Instance: detect and segment each object instance (Mask R-CNN). Panoptic: unify instance (things) + semantic segmentation (stuff). Each pixel gets unique ID (instance) or stuff label.
18
Why are small objects hard to detect? Solutions?
🔥 Hard
Answer: Low resolution, few pixels, anchor mismatch, feature map downsampling. Solutions: FPN, high-resolution input, copy-paste augmentation, better anchor design, context modeling, GAN-based super-resolution.
19
Other metrics for object detection?
📊 Medium
Answer: FPS (speed), FLOPs, model size, AP@0.5, AP@0.75, AP_small/medium/large, recall, precision, inference latency. COCO also reports AR (average recall).
20
What problem does Deformable DETR solve?
🔥 Hard
Answer: DETR slow convergence, high complexity on feature maps. Deformable DETR: attends to small set of key sampling points (deformable attention). Multi-scale features, faster training, better small object performance.
Object Detection – Interview Cheat Sheet
Detector Families
- Two-stage Faster R-CNN, Mask R-CNN, Cascade R-CNN
- One-stage YOLO, SSD, RetinaNet
- Transformer DETR, Deformable DETR
Key Components
- Anchor boxes / object queries
- NMS / Soft-NMS
- FPN / BiFPN
- RoI Align / RoI Pooling
Loss Functions
- Focal Loss Class imbalance
- GIoU/DIoU/CIoU Better box regression
- Hungarian DETR set loss
Speed vs Accuracy
- YOLOv7: real-time
- RetinaNet: balanced
- Cascade R-CNN: high accuracy
Verdict: "Anchor-based dominates, but transformer detectors are rising. Know your IoU, NMS, and FPN."