Object Detection Two-Stage One-Stage
Anchor Boxes mAP

Object Detection: Localize & Classify

Beyond image classification: draw bounding boxes around every object. From R-CNN to YOLOv8 and DETR — master the architectures that power autonomous vehicles, medical imaging, and visual search.

Bounding Box

x, y, w, h

IoU

Intersection over Union

NMS

Non-Max Suppression

mAP

Mean Average Precision

What is Object Detection?

Object detection = localization (where?) + classification (what?). Output: variable number of bounding boxes with class labels.

Input Image Detector [(x₁,y₁,w₁,h₁, class₁), ..., (xₙ,yₙ,wₙ,hₙ, classₙ)]

Challenge: varying number of objects, scale, occlusion, real-time speed.

✓ Classification: what?
✓ Localization: where?
✓ Multiple objects

Detection Fundamentals

IoU (Intersection over Union)

IoU = Area of Overlap / Area of Union

Threshold: typically 0.5 (PASCAL) or 0.5:0.95 (COCO).

def iou(box1, box2):
    # box: [x1,y1,x2,y2]
    inter = max(0, min(b1[2],b2[2]) - max(b1[0],b2[0])) * ...
    union = (b1[2]-b1[0])*(b1[3]-b1[1]) + ...
    return inter / union
Non-Max Suppression (NMS)

Remove duplicate detections: pick highest confidence, suppress overlapping boxes.

  1. Sort by confidence
  2. Select top box, remove IoU > threshold
  3. Repeat
Anchor Boxes

Predefined boxes of different scales/aspect ratios. Network predicts offsets from anchors.

Faster R-CNN: 9 anchors/position. YOLO: 5-9 clusters.

Two-Stage Detectors: R-CNN → Fast → Faster

R-CNN (2014)

Selective Search → 2000 region proposals → warp → CNN → SVM + bbox regressor.

Slow: 47s/image.

Fast R-CNN (2015)

Single CNN forward pass. RoI pooling layer extracts features for each proposal.

Multi-task loss: classification + bbox regression.

0.3s/image.

Faster R-CNN (2015)

Region Proposal Network (RPN) replaces selective search. End-to-end trainable.

Anchors at each location. Still two-stage but real-time capable.

Faster R-CNN with PyTorch (Torchvision)
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn

# Load pretrained model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Inference
with torch.no_grad():
    predictions = model(image_tensor)
    # predictions: list[dict] with 'boxes', 'labels', 'scores'

# Fine-tune on custom dataset
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
model = fasterrcnn_resnet50_fpn(pretrained=True)
num_classes = 10  # your classes + background
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

One-Stage Detectors: YOLO & SSD

Single forward pass: simultaneous classification + localization. Much faster, ideal for real-time.

YOLO – You Only Look Once

Divide image into S×S grid. Each cell predicts B boxes + confidence + C class probabilities.

Loss = λ_coord * L_xywh + λ_obj * L_obj + λ_noobj * L_noobj + L_class

YOLOv1 YOLOv2/v3 YOLOv4 YOLOv5 YOLOv6 YOLOv7 YOLOv8

Anchor boxes, multi-scale, CSPNet, anchor-free variants.

SSD – Single Shot MultiBox Detector

Multi-scale feature maps. Predict offsets from default boxes at each scale.

No RPN, no resampling. Faster than Faster R-CNN, competitive accuracy.

# SSD backbone: VGG or ResNet
# Extra convolutional layers for scale pyramid
YOLOv5 / YOLOv8 Inference (Ultralytics)
from ultralytics import YOLO

# Load pretrained model
model = YOLO('yolov8n.pt')  # nano, small, medium, large, xlarge

# Run inference
results = model('image.jpg', save=True)

# Access detections
for r in results:
    for box in r.boxes:
        x1,y1,x2,y2 = box.xyxy[0].tolist()
        conf = box.conf[0].item()
        cls = int(box.cls[0].item())
        print(f"{cls} @ {conf:.2f}: [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")

# Train on custom data
model.train(data='coco128.yaml', epochs=50)

RetinaNet: Focal Loss for Class Imbalance

One-stage detectors used to lag in accuracy due to extreme foreground-background imbalance. Focal loss solves it.

Focal Loss Formula

FL(pₜ) = -αₜ(1-pₜ)ᵞ log(pₜ)

γ=2, α=0.25. Down-weights easy examples, focuses on hard misclassifications.

RetinaNet = ResNet/FPN backbone + two subnetworks (class + box).

# Focal Loss implementation
class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, preds, targets):
        ce_loss = F.binary_cross_entropy_with_logits(preds, targets, reduction='none')
        p_t = preds * targets + (1 - preds) * (1 - targets)
        focal_weight = (1 - p_t) ** self.gamma
        focal_loss = focal_weight * ce_loss
        return focal_loss.mean()

Anchor-Free & Transformer Detectors

Anchor-Free Detectors

Predict keypoints or center points instead of anchor offsets.

  • CenterNet: Object as center point + width/height
  • FCOS: Per-pixel prediction, multi-level FPN
  • CornerNet: Detect corners, group embeddings

Fewer hyperparameters, simpler.

DETR – Detection Transformer

End-to-end object detection with Transformers. No anchors, no NMS (bipartite matching).

Encoder-decoder: CNN backbone + transformer + fixed set of object queries.

Hungarian loss matches predictions to ground truth.

DETR Inference (Hugging Face)
from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# Convert outputs to COCO format
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    if score > 0.7:
        print(f"{model.config.id2label[label.item()]}: {score:.2f} at {box.tolist()}")

Evaluation: mAP (mean Average Precision)

Standard metric for object detection (PASCAL VOC, COCO).

Precision-Recall Curve

For each class, rank detections by confidence. Compute precision/recall at each rank.

AP = area under P-R curve.

mAP = mean AP over all classes.

COCO mAP
  • mAP@0.5: IoU threshold 0.5 (PASCAL)
  • mAP@0.5:0.95: average over IoU thresholds 0.5 to 0.95 step 0.05 (COCO primary)
  • mAP_small, mAP_medium, mAP_large
# Using torchmetrics
from torchmetrics.detection.mean_ap import MeanAveragePrecision

metric = MeanAveragePrecision(iou_thresholds=[0.5, 0.75])
metric.update(predictions, ground_truths)
results = metric.compute()

Training Tricks & Augmentation

✅ Multi-scale training

Random resize between 640-800px each iteration.

✅ Mosaic augmentation

Combine 4 images into one (YOLOv4). Boosts small object detection.

✅ MixUp / CutMix

Blend images and labels.

⚠️ Class imbalance: Focal loss, hard negative mining.
⚠️ Overfitting: Use pretrained backbone, heavy augmentation, dropout.

Detector Comparison

Detector Type Speed (FPS) COCO mAP Key Feature
Faster R-CNNTwo-stage7-1542-45RPN, accuracy benchmark
SSDOne-stage40-6028-33Multi-scale defaults
YOLOv5One-stage80-14050-55Speed-accuracy tradeoff
YOLOv8One-stage80-16053-57Anchor-free, SOTA
RetinaNetOne-stage20-4040-44Focal loss
DETRTransformer20-2842-45No anchors/NMS
DINOTransformer20-3063+SOTA Transformer

Real-World Object Detection

Autonomous Vehicles

Cars, pedestrians, traffic signs

Medical Imaging

Tumors, lesions, cells

Robotics

Object grasping, navigation

Retail

Shelf monitoring, checkout-free

Deploying Object Detectors

TensorRT / ONNX

Optimize YOLO/Faster R-CNN for GPU inference. 2-5x speedup.

OpenVINO

Intel CPU/VPU acceleration. Popular for edge.

# Export YOLOv8 to ONNX
model.export(format='onnx', imgsz=640)

# TensorRT
model.export(format='engine', device=0)

# TorchScript
scripted_model = torch.jit.script(model)
scripted_model.save('detector.pt')

Object Detection Cheatsheet

IoU overlap measure
NMS remove duplicates
Anchor predefined boxes
RPN region proposals
YOLO grid + regression
SSD multi-scale
Focal Loss class imbalance
DETR transformer
mAP evaluation
FPN feature pyramid