Related Computer Vision Links
Learn Action Computer Vision Tutorial, validate concepts with Action Computer Vision MCQ Questions, and prepare interviews through Action Computer Vision Interview Questions and Answers.
Computer Vision Interview
20 essential Q&A
Updated 2026
Action
Action Recognition: 20 Essential Q&A
Classify what happens in a clip—appearance, motion, and long-context modeling.
~12 min read
20 questions
Advanced
Kineticstwo-streamI3DSlowFast
Quick Navigation
1
What is action recognition?
⚡ easy
Answer: Predict human action label(s) from a video clip—may include temporal localization and multi-person context.
2
Two-stream networks?
📊 medium
Answer: RGB CNN for appearance + flow CNN for motion; fuse scores—strong baseline before full 3D models.
3
C3D?
📊 medium
Answer: VGG-style 3×3×3 conv stacks on short clips—learns spatiotemporal filters end-to-end.
4
I3D?
🔥 hard
Answer: Inflate 2D ImageNet filters to 3D for initialization—bootstrap video nets from image pretraining, big jump on Kinetics.
# I3D: inflate 2D k×k filters to k×k×k, bootstrap from ImageNet
5
TSN?
📊 medium
Answer: Sparse sampling of frames/segments across video; aggregate (avg/max) after shared CNN—covers long videos cheaply.
6
SlowFast?
🔥 hard
Answer: Dual pathways with different frame rates and widths—efficient motion + semantics; widely used backbone.
7
X3D?
📊 medium
Answer: Expand dimensions (depth, width, resolution, frames) systematically for accuracy–compute Pareto.
8
Kinetics?
⚡ easy
Answer: Large-scale YouTube action dataset (400/600/700 classes)—ImageNet moment for video recognition.
9
Something-Something?
📊 medium
Answer: Emphasizes fine-grained motion and object interactions—tests temporal reasoning more than appearance.
10
Early vs late fusion?
🔥 hard
Answer: Early: stack frames/channels in first layers. Late: per-frame features then aggregate—middle fusion via 3D conv is common compromise.
11
LSTM over CNN features?
📊 medium
Answer: Encode temporal order for variable-length clips—lighter 3D but can underfit complex motion vs Transformers.
12
Space-time attention?
🔥 hard
Answer: Factorized or joint attention over patches and time (TimeSformer, Video Swin)—scalable with data.
13
Skeleton-based?
📊 medium
Answer: Graph convolutions on body joints—privacy-friendly, works with pose estimators; less appearance detail.
14
Long videos?
🔥 hard
Answer: Segment clips, hierarchical models, or memory—full movies need different pipelines than 10s Kinetics clips.
15
Temporal localization?
📊 medium
Answer: Detect action start/end (SSN, BMN) or per-frame labels—needed for untrimmed video.
16
Multi-label?
📊 medium
Answer: Sigmoid + BCE when several actions co-occur (cooking + talking)—different from single softmax.
17
Weak supervision?
🔥 hard
Answer: Train with video tags only (MIL) or narration—reduces frame-level annotation cost.
18
Real-time?
⚡ easy
Answer: Smaller X3D, MobileNet-3D, or keyframe-only inference—latency budgets for edge cameras.
19
Transfer learning?
📊 medium
Answer: ImageNet → inflate → Kinetics finetune → downstream (surgery, sports)—standard recipe.
20
Metrics?
⚡ easy
Answer: Top-1/5 accuracy; mAP for detection/localization; sometimes calibrated confidence for safety apps.
Action Recognition Cheat Sheet
Classic
- Two-stream
- TSN
3D
- I3D
- SlowFast
Data
- Kinetics
💡 Pro tip: Appearance + motion; I3D inflation; SlowFast dual paths.
Full tutorial track
Go deeper with the matching tutorial chapter and code examples.