Computer Vision Chapter 18

Object detection (intro)

Object detection outputs a set of bounding boxes and a class label (and confidence) for each object in an image. Unlike classification (“what is in the image?”), detection answers “where are the objects and what are they?” Modern detectors are almost always learned end-to-end from large labeled datasets (COCO, Open Images). This page defines IoU, NMS, mAP, contrasts one-stage vs two-stage designs, and shows a short Faster R-CNN inference script.

Bounding boxes and scores

A box is often stored as (x_min, y_min, x_max, y_max) in pixel coordinates, or center (cx, cy) with width/height. Each prediction includes class probabilities (or logits) and an objectness score in some architectures. Post-processing merges overlapping predictions.

IoU and non-maximum suppression

Intersection over Union (IoU) measures overlap between two boxes on [0, 1]. It gates “is this detection a match to ground truth?” during evaluation and training (e.g. assign anchors to targets).

def box_iou(a, b):
    # a,b = (x1,y1,x2,y2)
    xi1, yi1 = max(a[0], b[0]), max(a[1], b[1])
    xi2, yi2 = min(a[2], b[2]), min(a[3], b[3])
    inter = max(0, xi2 - xi1) * max(0, yi2 - yi1)
    aa = (a[2]-a[0])*(a[3]-a[1])
    bb = (b[2]-b[0])*(b[3]-b[1])
    return inter / (aa + bb - inter + 1e-6)

NMS keeps the highest-scoring box and discards others of the same class with IoU above a threshold (e.g. 0.5), repeating until the list is exhausted—this removes duplicate boxes on one object.

mAP and precision–recall

For each class, sort predictions by score; at each threshold compute precision and recall vs ground truth (matched by IoU ≥ 0.5 for COCO “AP50”). Average Precision (AP) is the area under the precision–recall curve. mAP averages AP over classes (and sometimes over IoU thresholds, e.g. COCO AP@[.5:.95]). Higher mAP = better overall detection quality.

Two-stage vs one-stage

Two-stage (e.g. R-CNN family)

First propose regions, then classify and refine boxes. Often more accurate, slower per image.

One-stage (e.g. YOLO, SSD, RetinaNet)

Dense predictions over a grid or anchors in one forward pass—favored for real-time and embedded.

Example: Faster R-CNN (torchvision)

import torch
import torchvision.transforms as T
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = fasterrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()

img = Image.open("street.jpg").convert("RGB")
x = T.functional.to_tensor(img).to(device)

with torch.no_grad():
    r = model([x])[0]

for i in range(len(r["scores"])):
    if r["scores"][i] < 0.5:
        continue
    box = r["boxes"][i].tolist()
    label = int(r["labels"][i])
    score = float(r["scores"][i])
    # draw box with PIL/OpenCV using COCO label names

Training requires a Dataset returning image tensor and target dict with boxes, labels, image_id—see torchvision detection reference.

Data and deployment

Strong augmentations (mosaic, mixup, random crop) are common for one-stage detectors. For deployment, export to ONNX or TensorRT, quantize to INT8 where accuracy allows, and batch inputs for throughput.

Takeaways

  • Detection = where + what for multiple objects.
  • IoU and NMS are core to both training assignment and inference cleanup.
  • Compare models with mAP on the same benchmark and IoU rules.

Quick FAQ

Lower NMS IoU threshold, raise score threshold, or use softer-NMS variants. Duplicate predictions often mean the model is well-calibrated but NMS is too loose.

Yes—load ONNX/YOLO weights with cv2.dnn for C++ or Python without PyTorch at runtime. Pre/post-processing must match the exporter.