SORT pipeline
- Detect — bounding boxes + scores from YOLO, Faster R-CNN, etc.
- Predict — each active track advances its Kalman state (e.g. box center and scale velocity).
- Associate — build a cost matrix (often
1 − IoUbetween predicted boxes and detections); solve assignment with Hungarian / linear sum assignment. - Update — matched tracks get Kalman correction; unmatched detections spawn new tracks; unmatched tracks age out after T frames.
IoU cost and assignment (sketch)
import numpy as np
from scipy.optimize import linear_sum_assignment
def iou(a, b):
x1 = max(a[0], b[0]); y1 = max(a[1], b[1])
x2 = min(a[2], b[2]); y2 = min(a[3], b[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
ua = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - inter
return inter / (ua + 1e-6)
# pred_boxes[i], det_boxes[j] as [x1,y1,x2,y2]
# cost[i,j] = 1 - iou(pred_i, det_j); gate large cost if iou < 0.01
cost = np.ones((len(pred_boxes), len(det_boxes)))
for i, p in enumerate(pred_boxes):
for j, d in enumerate(det_boxes):
v = iou(p, d)
cost[i, j] = 1.0 - v if v > 0.1 else 99.0
row_ind, col_ind = linear_sum_assignment(cost)
Production code vectorizes IoU and applies min-cost thresholds; this shows the idea.
DeepSORT: motion + appearance
DeepSORT keeps a Kalman state in measurement space (including box aspect ratio and height) and uses a Mahalanobis distance gate between predicted distribution and detections. A small CNN crops the pedestrian bbox and outputs a L2-normalized embedding; cosine distance between track gallery and detection provides a second association cue. Final cost blends motion and appearance (with tuned weights and gates).
Why embeddings?
IoU fails when people cross paths—appearance helps disambiguate after split seconds of overlap.
Modern variants
StrongSORT, BoT-SORT, OC-SORT refine association and camera motion; check MOTChallenge leaderboards.
Libraries
Reference repos implement full SORT/DeepSORT (track lifecycle, cascade matching). Ultralytics and BoxMOT-style packages integrate trackers with YOLO outputs. For learning, read the original papers then trace one clean Python implementation.
MOT metrics (names)
MOTA combines false positives, false negatives, and ID switches. IDF1 emphasizes identity consistency. HOTA unifies detection and association quality. Benchmarks: MOT17, MOT20, DanceTrack.
Takeaways
- SORT = detector + Kalman + IoU + Hungarian + track management.
- DeepSORT adds ReID embeddings for harder association.
- Detector quality dominates overall MOT performance.