SORT and DeepSORT: CV guide

SORT pipeline

Detect — bounding boxes + scores from YOLO, Faster R-CNN, etc.
Predict — each active track advances its Kalman state (e.g. box center and scale velocity).
Associate — build a cost matrix (often 1 − IoU between predicted boxes and detections); solve assignment with Hungarian / linear sum assignment.
Update — matched tracks get Kalman correction; unmatched detections spawn new tracks; unmatched tracks age out after T frames.

IoU cost and assignment (sketch)

import numpy as np
from scipy.optimize import linear_sum_assignment

def iou(a, b):
    x1 = max(a[0], b[0]); y1 = max(a[1], b[1])
    x2 = min(a[2], b[2]); y2 = min(a[3], b[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    ua = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - inter
    return inter / (ua + 1e-6)

# pred_boxes[i], det_boxes[j] as [x1,y1,x2,y2]
# cost[i,j] = 1 - iou(pred_i, det_j); gate large cost if iou < 0.01
cost = np.ones((len(pred_boxes), len(det_boxes)))
for i, p in enumerate(pred_boxes):
    for j, d in enumerate(det_boxes):
        v = iou(p, d)
        cost[i, j] = 1.0 - v if v > 0.1 else 99.0

row_ind, col_ind = linear_sum_assignment(cost)

Production code vectorizes IoU and applies min-cost thresholds; this shows the idea.

DeepSORT: motion + appearance

DeepSORT keeps a Kalman state in measurement space (including box aspect ratio and height) and uses a Mahalanobis distance gate between predicted distribution and detections. A small CNN crops the pedestrian bbox and outputs a L2-normalized embedding; cosine distance between track gallery and detection provides a second association cue. Final cost blends motion and appearance (with tuned weights and gates).

Why embeddings?

IoU fails when people cross paths—appearance helps disambiguate after split seconds of overlap.

Modern variants

StrongSORT, BoT-SORT, OC-SORT refine association and camera motion; check MOTChallenge leaderboards.

Libraries

Reference repos implement full SORT/DeepSORT (track lifecycle, cascade matching). Ultralytics and BoxMOT-style packages integrate trackers with YOLO outputs. For learning, read the original papers then trace one clean Python implementation.

MOT metrics (names)

MOTA combines false positives, false negatives, and ID switches. IDF1 emphasizes identity consistency. HOTA unifies detection and association quality. Benchmarks: MOT17, MOT20, DanceTrack.

                    Takeaways
                    SORT = detector + Kalman + IoU + Hungarian + track management.
DeepSORT adds ReID embeddings for harder association.
Detector quality dominates overall MOT performance.

                

Quick FAQ

Improve detector recall, tighten Kalman noise, tune IoU threshold, or upgrade to appearance-based association. Very fast motion needs higher frame rate or better motion model.

Run detector sparsely and propagate with Kalman or correlation trackers between runs—trade accuracy for speed; re-sync when detector fires.

Related Computer Vision Links

SORT & DeepSORT

SORT pipeline

IoU cost and assignment (sketch)

DeepSORT: motion + appearance

Why embeddings?

Modern variants

Libraries

MOT metrics (names)

Takeaways

Quick FAQ

Related Computer Vision Links

SORT pipeline

IoU cost and assignment (sketch)

DeepSORT: motion + appearance

Why embeddings?

Modern variants

Libraries

MOT metrics (names)

Takeaways

Quick FAQ

ID switches still high?

Track without detector every frame?