Action Recognition: 20 Essential Q&A

Question 1

1 What is action recognition? ⚡ easy

Answer

Answer: Predict human action label(s) from a video clip—may include temporal localization and multi-person context.

Question 2

2 Two-stream networks? 📊 medium

Answer

Answer: RGB CNN for appearance + flow CNN for motion; fuse scores—strong baseline before full 3D models.

Question 3

3 C3D? 📊 medium

Answer

Answer: VGG-style 3×3×3 conv stacks on short clips—learns spatiotemporal filters end-to-end.

Question 4

4 I3D? 🔥 hard

Answer

Answer: Inflate 2D ImageNet filters to 3D for initialization—bootstrap video nets from image pretraining, big jump on Kinetics.

Question 5

5 TSN? 📊 medium

Answer

Answer: Sparse sampling of frames/segments across video; aggregate (avg/max) after shared CNN—covers long videos cheaply.

Question 6

6 SlowFast? 🔥 hard

Answer

Answer: Dual pathways with different frame rates and widths—efficient motion + semantics; widely used backbone.

Question 7

7 X3D? 📊 medium

Answer

Answer: Expand dimensions (depth, width, resolution, frames) systematically for accuracy–compute Pareto.

Question 8

8 Kinetics? ⚡ easy

Answer

Answer: Large-scale YouTube action dataset (400/600/700 classes)—ImageNet moment for video recognition.

Question 9

9 Something-Something? 📊 medium

Answer

Answer: Emphasizes fine-grained motion and object interactions—tests temporal reasoning more than appearance.

Question 10

10 Early vs late fusion? 🔥 hard

Answer

Answer: Early: stack frames/channels in first layers. Late: per-frame features then aggregate—middle fusion via 3D conv is common compromise.

Question 11

11 LSTM over CNN features? 📊 medium

Answer

Answer: Encode temporal order for variable-length clips—lighter 3D but can underfit complex motion vs Transformers.

Question 12

12 Space-time attention? 🔥 hard

Answer

Answer: Factorized or joint attention over patches and time (TimeSformer, Video Swin)—scalable with data.

Question 13

13 Skeleton-based? 📊 medium

Answer

Answer: Graph convolutions on body joints—privacy-friendly, works with pose estimators; less appearance detail.

Question 14

14 Long videos? 🔥 hard

Answer

Answer: Segment clips, hierarchical models, or memory—full movies need different pipelines than 10s Kinetics clips.

Question 15

15 Temporal localization? 📊 medium

Answer

Answer: Detect action start/end (SSN, BMN) or per-frame labels—needed for untrimmed video.

Question 16

16 Multi-label? 📊 medium

Answer

Answer: Sigmoid + BCE when several actions co-occur (cooking + talking)—different from single softmax.

Question 17

17 Weak supervision? 🔥 hard

Answer

Answer: Train with video tags only (MIL) or narration—reduces frame-level annotation cost.

Question 18

18 Real-time? ⚡ easy

Answer

Answer: Smaller X3D, MobileNet-3D, or keyframe-only inference—latency budgets for edge cameras.

Question 19

19 Transfer learning? 📊 medium

Answer

Answer: ImageNet → inflate → Kinetics finetune → downstream (surgery, sports)—standard recipe.

Question 20

20 Metrics? ⚡ easy

Answer

Answer: Top-1/5 accuracy; mAP for detection/localization; sometimes calibrated confidence for safety apps.

Related Computer Vision Links