Action Recognition MCQ 15 Questions
Time: ~25 mins Advanced

Action Recognition MCQ

Classify what is happening in a clip—combine appearance and motion cues over time.

Easy: 5 Q Medium: 6 Q Hard: 4 Q
Human action

Classes

Clip

Spatiotemporal

Two-stream

RGB + flow

I3D

Inflated 3D

Actions in video

Action recognition assigns a label (e.g. diving, waving) to short clips. Methods range from frame CNNs + temporal pooling to two-stream RGB and optical-flow fusion, 3D convolutions (C3D, I3D), and transformers over space-time tokens. Large datasets (Kinetics) drive supervised pretraining.

Why motion matters

Static frames can be ambiguous; temporal patterns distinguish many action classes.

Key ideas

Clip input

Fixed-length segment sampled from longer video.

Two-stream

Separate nets for appearance and motion then fuse.

3D CNN

Spatiotemporal filters learn motion templates.

Kinetics

Large-scale labeled clips for pretraining.

Typical head

backbone features → temporal aggregate → softmax over action classes

Pro tip: Test-time augmentation: multi-crop and multi-scale clips improve robustness.