Computer Vision Chapter 38

Action recognition

Action recognition assigns a label to a short video clip (e.g. “playing guitar”, “running”). Unlike single-image classification, models must aggregate time. Classic approaches: two-stream networks (RGB appearance + stacked optical flow), 3D convolutions (C3D, I3D, R(2+1)D) that extend 2D kernels across time, and modern video transformers / large multimodal encoders. Datasets such as Kinetics drive benchmarks. Below: tensor shape for clips and torchvision 3D ResNet inference.

Clip tensor shape

Many 3D CNNs expect input [N, C, T, H, W]: batch, channels (usually 3), T RGB frames, height, width. Uniformly sample or stride frames across the segment, resize/crop to the model’s training resolution (often 112×112 or 224×224).

import torch

# Example: 16 frames, 3 channels, 112x112
clip = torch.randn(1, 3, 16, 112, 112)

torchvision: R3D-18 (Kinetics-400)

from torchvision.models.video import r3d_18, R3D_18_Weights

weights = R3D_18_Weights.KINETICS400_V1
model = r3d_18(weights=weights).eval()
preprocess = weights.transforms()

# clip: [1, 3, T, H, W] after preprocess
with torch.no_grad():
    logits = model(clip)
probs = logits.softmax(1)
top = int(probs.argmax(1))
print(weights.meta["categories"][top])

Check weights.transforms() for required T and spatial size for your torchvision version.

Other torchvision video models

from torchvision.models.video import mc3_18, s3d, MC3_18_Weights, S3D_Weights

# Mixed convolution (MC3), Separable 3D (S3D) — same pattern as R3D: weights + transforms()
# Newer torchvision builds may also expose MViT-style video transformers; check the docs.

API names vary slightly by PyTorch release; use dir(torchvision.models.video) locally.

Two-stream idea

One stream ingests RGB frames for appearance; another ingests stacked optical flow volumes for motion. Late fusion averages or learns to combine logits—still a strong conceptual baseline before end-to-end 3D nets dominated many benchmarks.

Takeaways

  • Temporal receptive field matters: short clips may miss context.
  • Pretrained Kinetics weights transfer to smaller datasets via fine-tuning.
  • Consider compute: 3D convs and transformers are heavier than 2D per-frame models.

Quick FAQ

Sample fixed T, use multiple clips + average logits, or use models with temporal pooling / attention over many frames.

I3D inflates 2D Inception weights to 3D; R(2+1)D factorizes a 3D kernel into 2D spatial + 1D temporal for efficiency—both are standard 3D CNN families.