Video Processing: 20 Essential Q&A

Question 1

1 How is video different from images? ⚡ easy

Answer

Answer: Adds a time axis—motion, temporal context, and much higher data rate; models must aggregate or align frames.

Question 2

2 What is FPS? ⚡ easy

Answer

Answer: Frames per second—higher FPS preserves motion detail but increases compute; downsample for recognition if action is slow.

Question 3

3 Temporal redundancy? 📊 medium

Answer

Answer: Adjacent frames are highly correlated—exploited by codecs (motion compensation) and by sparse sampling for recognition.

Question 4

4 Frame sampling strategies? 📊 medium

Answer

Answer: Uniform stride, random jitter, clip cropping, or dense optical flow—trade temporal coverage vs GPU memory.

Question 5

5 3D convolutions? 📊 medium

Answer

Answer: Kernel spans H×W×T—learns motion patterns directly; expensive vs 2D+temporal modules.

Question 6

6 Two-stream architecture? 📊 medium

Answer

Answer: One stream on RGB (appearance), one on stacked optical flow (motion)—late fusion for action recognition classic baseline.

Question 7

7 CNN + LSTM? 📊 medium

Answer

Answer: Per-frame CNN embeddings fed to LSTM/GRU to model sequence—flexible length, order-sensitive.

Question 8

8 Codec vs container? ⚡ easy

Answer

Answer: Codec (H.264/HEVC/AV1) compresses pixels; container (MP4/MKV) wraps streams and metadata—decode to raw frames for training.

Question 9

9 Decode pipeline? 📊 medium

Answer

Answer: VideoCapture → seek/keyframe → decode to tensors—hardware decoders (NVDEC) critical for throughput at scale.

Question 10

10 Background subtraction? 📊 medium

Answer

Answer: MOG2, KNN—foreground mask for surveillance; sensitive to lighting and camera jitter.

Question 11

11 Stabilization? 🔥 hard

Answer

Answer: Estimate global motion, warp frames to smooth path—used in consumer video; different from optical flow per pixel.

Question 12

12 Tracking in video? 📊 medium

Answer

Answer: Detection per frame + association (SORT, DeepSORT)—links identities across time for analytics.

Question 13

13 Long videos? 🔥 hard

Answer

Answer: Clip sampling, hierarchical pooling, memory transformers, or sparse attention—full-length movies don’t fit one forward pass.

Question 14

14 Augmentation? 📊 medium

Answer

Answer: Temporal jitter, random clip length, mixup on clips, cutout—must preserve label semantics (temporal order for actions).

Question 15

15 Example datasets? ⚡ easy

Answer

Answer: Kinetics, Something-Something, UCF101, AVA—vary in duration, labels, and multi-label structure.

Question 16

16 Real-time? 📊 medium

Answer

Answer: Batch size 1, TensorRT, smaller backbone, ROI processing—latency budget drives architecture choice.

Question 17

17 SlowFast pathways? 🔥 hard

Answer

Answer: Slow branch low frame rate high channel capacity; Fast branch high FPS lightweight—captures semantics and motion efficiently.

Question 18

18 Video ViT? 🔥 hard

Answer

Answer: Patchify across space-time or factorized space/time attention—scalable with large data (TimeSformer, Video Swin).

Question 19

19 Anomaly detection? 📊 medium

Answer

Answer: Reconstruction or prediction error in latent space across frames—surveillance and manufacturing QC.

Question 20

20 Deployment? ⚡ easy

Answer

Answer: Stream ingestion (RTSP), frame sync, GPU decode, model serving with batching policies for variable FPS.

Related Computer Vision Links