Action recognition: CV guide

Clip tensor shape

Many 3D CNNs expect input [N, C, T, H, W]: batch, channels (usually 3), T RGB frames, height, width. Uniformly sample or stride frames across the segment, resize/crop to the model’s training resolution (often 112×112 or 224×224).

import torch

# Example: 16 frames, 3 channels, 112x112
clip = torch.randn(1, 3, 16, 112, 112)

torchvision: R3D-18 (Kinetics-400)

from torchvision.models.video import r3d_18, R3D_18_Weights

weights = R3D_18_Weights.KINETICS400_V1
model = r3d_18(weights=weights).eval()
preprocess = weights.transforms()

# clip: [1, 3, T, H, W] after preprocess
with torch.no_grad():
    logits = model(clip)
probs = logits.softmax(1)
top = int(probs.argmax(1))
print(weights.meta["categories"][top])

Check weights.transforms() for required T and spatial size for your torchvision version.

Two-stream idea

One stream ingests RGB frames for appearance; another ingests stacked optical flow volumes for motion. Late fusion averages or learns to combine logits—still a strong conceptual baseline before end-to-end 3D nets dominated many benchmarks.

                    Takeaways
                    Temporal receptive field matters: short clips may miss context.
Pretrained Kinetics weights transfer to smaller datasets via fine-tuning.
Consider compute: 3D convs and transformers are heavier than 2D per-frame models.

                

Quick FAQ

Sample fixed T, use multiple clips + average logits, or use models with temporal pooling / attention over many frames.

I3D inflates 2D Inception weights to 3D; R(2+1)D factorizes a 3D kernel into 2D spatial + 1D temporal for efficiency—both are standard 3D CNN families.

Related Computer Vision Links

Action recognition

Clip tensor shape

torchvision: R3D-18 (Kinetics-400)

Other torchvision video models

Two-stream idea

Takeaways

Quick FAQ

Related Computer Vision Links

Clip tensor shape

torchvision: R3D-18 (Kinetics-400)

Other torchvision video models

Two-stream idea

Takeaways

Quick FAQ

Variable-length videos?

I3D vs R(2+1)D?