Related Computer Vision Links
Learn Video Computer Vision Tutorial, validate concepts with Video Computer Vision MCQ Questions, and prepare interviews through Video Computer Vision Interview Questions and Answers.
Computer Vision Interview
20 essential Q&A
Updated 2026
Video
Video Processing: 20 Essential Q&A
Beyond single frames—time, motion cues, efficient decoding, and architectures that see short clips.
~11 min read
20 questions
Intermediate
temporalFPS3D convtwo-stream
Quick Navigation
1
How is video different from images?
⚡ easy
Answer: Adds a time axis—motion, temporal context, and much higher data rate; models must aggregate or align frames.
2
What is FPS?
⚡ easy
Answer: Frames per second—higher FPS preserves motion detail but increases compute; downsample for recognition if action is slow.
3
Temporal redundancy?
📊 medium
Answer: Adjacent frames are highly correlated—exploited by codecs (motion compensation) and by sparse sampling for recognition.
4
Frame sampling strategies?
📊 medium
Answer: Uniform stride, random jitter, clip cropping, or dense optical flow—trade temporal coverage vs GPU memory.
5
3D convolutions?
📊 medium
Answer: Kernel spans H×W×T—learns motion patterns directly; expensive vs 2D+temporal modules.
6
Two-stream architecture?
📊 medium
Answer: One stream on RGB (appearance), one on stacked optical flow (motion)—late fusion for action recognition classic baseline.
7
CNN + LSTM?
📊 medium
Answer: Per-frame CNN embeddings fed to LSTM/GRU to model sequence—flexible length, order-sensitive.
8
Codec vs container?
⚡ easy
Answer: Codec (H.264/HEVC/AV1) compresses pixels; container (MP4/MKV) wraps streams and metadata—decode to raw frames for training.
9
Decode pipeline?
📊 medium
Answer: VideoCapture → seek/keyframe → decode to tensors—hardware decoders (NVDEC) critical for throughput at scale.
10
Background subtraction?
📊 medium
Answer: MOG2, KNN—foreground mask for surveillance; sensitive to lighting and camera jitter.
cap = cv2.VideoCapture("clip.mp4"); ret, frame = cap.read()
11
Stabilization?
🔥 hard
Answer: Estimate global motion, warp frames to smooth path—used in consumer video; different from optical flow per pixel.
12
Tracking in video?
📊 medium
Answer: Detection per frame + association (SORT, DeepSORT)—links identities across time for analytics.
13
Long videos?
🔥 hard
Answer: Clip sampling, hierarchical pooling, memory transformers, or sparse attention—full-length movies don’t fit one forward pass.
14
Augmentation?
📊 medium
Answer: Temporal jitter, random clip length, mixup on clips, cutout—must preserve label semantics (temporal order for actions).
15
Example datasets?
⚡ easy
Answer: Kinetics, Something-Something, UCF101, AVA—vary in duration, labels, and multi-label structure.
16
Real-time?
📊 medium
Answer: Batch size 1, TensorRT, smaller backbone, ROI processing—latency budget drives architecture choice.
17
SlowFast pathways?
🔥 hard
Answer: Slow branch low frame rate high channel capacity; Fast branch high FPS lightweight—captures semantics and motion efficiently.
18
Video ViT?
🔥 hard
Answer: Patchify across space-time or factorized space/time attention—scalable with large data (TimeSformer, Video Swin).
19
Anomaly detection?
📊 medium
Answer: Reconstruction or prediction error in latent space across frames—surveillance and manufacturing QC.
20
Deployment?
⚡ easy
Answer: Stream ingestion (RTSP), frame sync, GPU decode, model serving with batching policies for variable FPS.
Video Cheat Sheet
Time
- Sampling
- Redundancy
Models
- 3D conv
- Two-stream
IO
- Decode path
💡 Pro tip: Always mention time axis, sampling, and memory tradeoffs.
Full tutorial track
Go deeper with the matching tutorial chapter and code examples.