Bounding boxes and scores
A box is often stored as (x_min, y_min, x_max, y_max) in pixel coordinates, or center (cx, cy) with width/height. Each prediction includes class probabilities (or logits) and an objectness score in some architectures. Post-processing merges overlapping predictions.
IoU and non-maximum suppression
Intersection over Union (IoU) measures overlap between two boxes on [0, 1]. It gates “is this detection a match to ground truth?” during evaluation and training (e.g. assign anchors to targets).
def box_iou(a, b):
# a,b = (x1,y1,x2,y2)
xi1, yi1 = max(a[0], b[0]), max(a[1], b[1])
xi2, yi2 = min(a[2], b[2]), min(a[3], b[3])
inter = max(0, xi2 - xi1) * max(0, yi2 - yi1)
aa = (a[2]-a[0])*(a[3]-a[1])
bb = (b[2]-b[0])*(b[3]-b[1])
return inter / (aa + bb - inter + 1e-6)
NMS keeps the highest-scoring box and discards others of the same class with IoU above a threshold (e.g. 0.5), repeating until the list is exhausted—this removes duplicate boxes on one object.
mAP and precision–recall
For each class, sort predictions by score; at each threshold compute precision and recall vs ground truth (matched by IoU ≥ 0.5 for COCO “AP50”). Average Precision (AP) is the area under the precision–recall curve. mAP averages AP over classes (and sometimes over IoU thresholds, e.g. COCO AP@[.5:.95]). Higher mAP = better overall detection quality.
Two-stage vs one-stage
Two-stage (e.g. R-CNN family)
First propose regions, then classify and refine boxes. Often more accurate, slower per image.
One-stage (e.g. YOLO, SSD, RetinaNet)
Dense predictions over a grid or anchors in one forward pass—favored for real-time and embedded.
Example: Faster R-CNN (torchvision)
import torch
import torchvision.transforms as T
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = fasterrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()
img = Image.open("street.jpg").convert("RGB")
x = T.functional.to_tensor(img).to(device)
with torch.no_grad():
r = model([x])[0]
for i in range(len(r["scores"])):
if r["scores"][i] < 0.5:
continue
box = r["boxes"][i].tolist()
label = int(r["labels"][i])
score = float(r["scores"][i])
# draw box with PIL/OpenCV using COCO label names
Training requires a Dataset returning image tensor and target dict with boxes, labels, image_id—see torchvision detection reference.
Data and deployment
Strong augmentations (mosaic, mixup, random crop) are common for one-stage detectors. For deployment, export to ONNX or TensorRT, quantize to INT8 where accuracy allows, and batch inputs for throughput.
Takeaways
- Detection = where + what for multiple objects.
- IoU and NMS are core to both training assignment and inference cleanup.
- Compare models with mAP on the same benchmark and IoU rules.
Quick FAQ
cv2.dnn for C++ or Python without PyTorch at runtime. Pre/post-processing must match the exporter.