Clip tensor shape
Many 3D CNNs expect input [N, C, T, H, W]: batch, channels (usually 3), T RGB frames, height, width. Uniformly sample or stride frames across the segment, resize/crop to the model’s training resolution (often 112×112 or 224×224).
import torch
# Example: 16 frames, 3 channels, 112x112
clip = torch.randn(1, 3, 16, 112, 112)
torchvision: R3D-18 (Kinetics-400)
from torchvision.models.video import r3d_18, R3D_18_Weights
weights = R3D_18_Weights.KINETICS400_V1
model = r3d_18(weights=weights).eval()
preprocess = weights.transforms()
# clip: [1, 3, T, H, W] after preprocess
with torch.no_grad():
logits = model(clip)
probs = logits.softmax(1)
top = int(probs.argmax(1))
print(weights.meta["categories"][top])
Check weights.transforms() for required T and spatial size for your torchvision version.
Other torchvision video models
from torchvision.models.video import mc3_18, s3d, MC3_18_Weights, S3D_Weights
# Mixed convolution (MC3), Separable 3D (S3D) — same pattern as R3D: weights + transforms()
# Newer torchvision builds may also expose MViT-style video transformers; check the docs.
API names vary slightly by PyTorch release; use dir(torchvision.models.video) locally.
Two-stream idea
One stream ingests RGB frames for appearance; another ingests stacked optical flow volumes for motion. Late fusion averages or learns to combine logits—still a strong conceptual baseline before end-to-end 3D nets dominated many benchmarks.
Takeaways
- Temporal receptive field matters: short clips may miss context.
- Pretrained Kinetics weights transfer to smaller datasets via fine-tuning.
- Consider compute: 3D convs and transformers are heavier than 2D per-frame models.
Quick FAQ
T, use multiple clips + average logits, or use models with temporal pooling / attention over many frames.