Computer Vision Chapter 24

SORT & DeepSORT

SORT (Simple Online and Realtime Tracking) is a lightweight multi-object tracking (MOT) pipeline: run a detector each frame, propagate each track with a Kalman filter on the bounding box, then associate detections to tracks using IoU and the Hungarian algorithm. DeepSORT adds an appearance embedding (ReID CNN) so re-identification survives longer occlusions when motion alone is ambiguous.

SORT pipeline

  1. Detect — bounding boxes + scores from YOLO, Faster R-CNN, etc.
  2. Predict — each active track advances its Kalman state (e.g. box center and scale velocity).
  3. Associate — build a cost matrix (often 1 − IoU between predicted boxes and detections); solve assignment with Hungarian / linear sum assignment.
  4. Update — matched tracks get Kalman correction; unmatched detections spawn new tracks; unmatched tracks age out after T frames.

IoU cost and assignment (sketch)

import numpy as np
from scipy.optimize import linear_sum_assignment

def iou(a, b):
    x1 = max(a[0], b[0]); y1 = max(a[1], b[1])
    x2 = min(a[2], b[2]); y2 = min(a[3], b[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    ua = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - inter
    return inter / (ua + 1e-6)

# pred_boxes[i], det_boxes[j] as [x1,y1,x2,y2]
# cost[i,j] = 1 - iou(pred_i, det_j); gate large cost if iou < 0.01
cost = np.ones((len(pred_boxes), len(det_boxes)))
for i, p in enumerate(pred_boxes):
    for j, d in enumerate(det_boxes):
        v = iou(p, d)
        cost[i, j] = 1.0 - v if v > 0.1 else 99.0

row_ind, col_ind = linear_sum_assignment(cost)

Production code vectorizes IoU and applies min-cost thresholds; this shows the idea.

DeepSORT: motion + appearance

DeepSORT keeps a Kalman state in measurement space (including box aspect ratio and height) and uses a Mahalanobis distance gate between predicted distribution and detections. A small CNN crops the pedestrian bbox and outputs a L2-normalized embedding; cosine distance between track gallery and detection provides a second association cue. Final cost blends motion and appearance (with tuned weights and gates).

Why embeddings?

IoU fails when people cross paths—appearance helps disambiguate after split seconds of overlap.

Modern variants

StrongSORT, BoT-SORT, OC-SORT refine association and camera motion; check MOTChallenge leaderboards.

Libraries

Reference repos implement full SORT/DeepSORT (track lifecycle, cascade matching). Ultralytics and BoxMOT-style packages integrate trackers with YOLO outputs. For learning, read the original papers then trace one clean Python implementation.

MOT metrics (names)

MOTA combines false positives, false negatives, and ID switches. IDF1 emphasizes identity consistency. HOTA unifies detection and association quality. Benchmarks: MOT17, MOT20, DanceTrack.

Takeaways

  • SORT = detector + Kalman + IoU + Hungarian + track management.
  • DeepSORT adds ReID embeddings for harder association.
  • Detector quality dominates overall MOT performance.

Quick FAQ

Improve detector recall, tighten Kalman noise, tune IoU threshold, or upgrade to appearance-based association. Very fast motion needs higher frame rate or better motion model.

Run detector sparsely and propagate with Kalman or correlation trackers between runs—trade accuracy for speed; re-sync when detector fires.