Computer Vision Interview 20 essential Q&A Updated 2026
Action

Action Recognition: 20 Essential Q&A

Classify what happens in a clip—appearance, motion, and long-context modeling.

~12 min read 20 questions Advanced
Kineticstwo-streamI3DSlowFast
1 What is action recognition? ⚡ easy
Answer: Predict human action label(s) from a video clip—may include temporal localization and multi-person context.
2 Two-stream networks? 📊 medium
Answer: RGB CNN for appearance + flow CNN for motion; fuse scores—strong baseline before full 3D models.
3 C3D? 📊 medium
Answer: VGG-style 3×3×3 conv stacks on short clips—learns spatiotemporal filters end-to-end.
4 I3D? 🔥 hard
Answer: Inflate 2D ImageNet filters to 3D for initialization—bootstrap video nets from image pretraining, big jump on Kinetics.
# I3D: inflate 2D k×k filters to k×k×k, bootstrap from ImageNet
5 TSN? 📊 medium
Answer: Sparse sampling of frames/segments across video; aggregate (avg/max) after shared CNN—covers long videos cheaply.
6 SlowFast? 🔥 hard
Answer: Dual pathways with different frame rates and widths—efficient motion + semantics; widely used backbone.
7 X3D? 📊 medium
Answer: Expand dimensions (depth, width, resolution, frames) systematically for accuracy–compute Pareto.
8 Kinetics? ⚡ easy
Answer: Large-scale YouTube action dataset (400/600/700 classes)—ImageNet moment for video recognition.
9 Something-Something? 📊 medium
Answer: Emphasizes fine-grained motion and object interactions—tests temporal reasoning more than appearance.
10 Early vs late fusion? 🔥 hard
Answer: Early: stack frames/channels in first layers. Late: per-frame features then aggregate—middle fusion via 3D conv is common compromise.
11 LSTM over CNN features? 📊 medium
Answer: Encode temporal order for variable-length clips—lighter 3D but can underfit complex motion vs Transformers.
12 Space-time attention? 🔥 hard
Answer: Factorized or joint attention over patches and time (TimeSformer, Video Swin)—scalable with data.
13 Skeleton-based? 📊 medium
Answer: Graph convolutions on body joints—privacy-friendly, works with pose estimators; less appearance detail.
14 Long videos? 🔥 hard
Answer: Segment clips, hierarchical models, or memory—full movies need different pipelines than 10s Kinetics clips.
15 Temporal localization? 📊 medium
Answer: Detect action start/end (SSN, BMN) or per-frame labels—needed for untrimmed video.
16 Multi-label? 📊 medium
Answer: Sigmoid + BCE when several actions co-occur (cooking + talking)—different from single softmax.
17 Weak supervision? 🔥 hard
Answer: Train with video tags only (MIL) or narration—reduces frame-level annotation cost.
18 Real-time? ⚡ easy
Answer: Smaller X3D, MobileNet-3D, or keyframe-only inference—latency budgets for edge cameras.
19 Transfer learning? 📊 medium
Answer: ImageNet → inflate → Kinetics finetune → downstream (surgery, sports)—standard recipe.
20 Metrics? ⚡ easy
Answer: Top-1/5 accuracy; mAP for detection/localization; sometimes calibrated confidence for safety apps.

Action Recognition Cheat Sheet

Classic
  • Two-stream
  • TSN
3D
  • I3D
  • SlowFast
Data
  • Kinetics

💡 Pro tip: Appearance + motion; I3D inflation; SlowFast dual paths.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.