Computer Vision Interview 20 essential Q&A Updated 2026
Video

Video Processing: 20 Essential Q&A

Beyond single frames—time, motion cues, efficient decoding, and architectures that see short clips.

~11 min read 20 questions Intermediate
temporalFPS3D convtwo-stream
1 How is video different from images? ⚡ easy
Answer: Adds a time axis—motion, temporal context, and much higher data rate; models must aggregate or align frames.
2 What is FPS? ⚡ easy
Answer: Frames per second—higher FPS preserves motion detail but increases compute; downsample for recognition if action is slow.
3 Temporal redundancy? 📊 medium
Answer: Adjacent frames are highly correlated—exploited by codecs (motion compensation) and by sparse sampling for recognition.
4 Frame sampling strategies? 📊 medium
Answer: Uniform stride, random jitter, clip cropping, or dense optical flow—trade temporal coverage vs GPU memory.
5 3D convolutions? 📊 medium
Answer: Kernel spans H×W×T—learns motion patterns directly; expensive vs 2D+temporal modules.
6 Two-stream architecture? 📊 medium
Answer: One stream on RGB (appearance), one on stacked optical flow (motion)—late fusion for action recognition classic baseline.
7 CNN + LSTM? 📊 medium
Answer: Per-frame CNN embeddings fed to LSTM/GRU to model sequence—flexible length, order-sensitive.
8 Codec vs container? ⚡ easy
Answer: Codec (H.264/HEVC/AV1) compresses pixels; container (MP4/MKV) wraps streams and metadata—decode to raw frames for training.
9 Decode pipeline? 📊 medium
Answer: VideoCapture → seek/keyframe → decode to tensors—hardware decoders (NVDEC) critical for throughput at scale.
10 Background subtraction? 📊 medium
Answer: MOG2, KNN—foreground mask for surveillance; sensitive to lighting and camera jitter.
cap = cv2.VideoCapture("clip.mp4"); ret, frame = cap.read()
11 Stabilization? 🔥 hard
Answer: Estimate global motion, warp frames to smooth path—used in consumer video; different from optical flow per pixel.
12 Tracking in video? 📊 medium
Answer: Detection per frame + association (SORT, DeepSORT)—links identities across time for analytics.
13 Long videos? 🔥 hard
Answer: Clip sampling, hierarchical pooling, memory transformers, or sparse attention—full-length movies don’t fit one forward pass.
14 Augmentation? 📊 medium
Answer: Temporal jitter, random clip length, mixup on clips, cutout—must preserve label semantics (temporal order for actions).
15 Example datasets? ⚡ easy
Answer: Kinetics, Something-Something, UCF101, AVA—vary in duration, labels, and multi-label structure.
16 Real-time? 📊 medium
Answer: Batch size 1, TensorRT, smaller backbone, ROI processing—latency budget drives architecture choice.
17 SlowFast pathways? 🔥 hard
Answer: Slow branch low frame rate high channel capacity; Fast branch high FPS lightweight—captures semantics and motion efficiently.
18 Video ViT? 🔥 hard
Answer: Patchify across space-time or factorized space/time attention—scalable with large data (TimeSformer, Video Swin).
19 Anomaly detection? 📊 medium
Answer: Reconstruction or prediction error in latent space across frames—surveillance and manufacturing QC.
20 Deployment? ⚡ easy
Answer: Stream ingestion (RTSP), frame sync, GPU decode, model serving with batching policies for variable FPS.

Video Cheat Sheet

Time
  • Sampling
  • Redundancy
Models
  • 3D conv
  • Two-stream
IO
  • Decode path

💡 Pro tip: Always mention time axis, sampling, and memory tradeoffs.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.