Computer Vision Chapter 36

Video processing

A video is a time-ordered sequence of frames (2D images). FPS (frames per second), resolution (width × height), and codec define how pixels are stored and decoded. Pipelines either run per frame (object detection) or model time explicitly (optical flow, 3D CNNs, transformers). OpenCV’s VideoCapture is the usual entry point for files and cameras; VideoWriter encodes output. Below: metadata, seeking, writing, and optional torchvision loading for tensor workflows.

OpenCV: read a file

import cv2

cap = cv2.VideoCapture("clip.mp4")
if not cap.isOpened():
    raise RuntimeError("cannot open video")

fps = cap.get(cv2.CAP_PROP_FPS)
w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
print(fps, w, h, n)

while True:
    ok, frame = cap.read()
    if not ok:
        break
    # frame: BGR uint8, shape (h, w, 3)

cap.release()

Always call release() (or use a context-style pattern) so file handles and camera devices are freed.

Frame index and seek

cap.set(cv2.CAP_PROP_POS_FRAMES, 120)  # jump to frame 120
ok, frame = cap.read()
ms = cap.get(cv2.CAP_PROP_POS_MSEC)     # position in ms (if available)

Seek accuracy depends on codec and container; keyframe-only seeking can land on the nearest keyframe.

Webcam and backend

cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)  # Windows: DirectShow optional
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
if not cap.isOpened():
    raise RuntimeError("camera not available")

Write MP4 (example)

fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter("out.mp4", fourcc, 30.0, (640, 480))
# ... for each frame BGR:
# out.write(frame)
out.release()

Codec fourcc must match what your OpenCV build supports; on some systems avc1 or H264 works better than mp4v.

torchvision: read_video

from torchvision.io import read_video

video, audio, info = read_video("clip.mp4", start_pts=0, end_pts=4, pts_unit="sec")
# video: (T, H, W, C) uint8 in RGB order
print(video.shape, info)

Useful for training clips; for long files prefer decoders that stream frames to limit RAM.

Takeaways

  • BGR in OpenCV vs RGB in many deep models—convert with cv2.cvtColor when needed.
  • Temporal methods need consistent FPS or explicit timestamps.
  • Next: optical flow ties neighboring frames through motion fields.

Quick FAQ

Decoding and disk I/O dominate; use smaller resolution, skip frames, GPU decoders, or extract frames offline.

Pass cv2.CAP_PROP_CONVERT_RGB, 0 where supported, or cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) after read.