Computer Vision Chapter 40

Pose estimation

2D pose estimation predicts keypoints (joints) of the human body in the image plane—e.g. nose, shoulders, elbows, hips, knees, ankles. The COCO person format uses 17 points with a standard skeleton for drawing limbs. Top-down methods detect people first then run a pose head per crop; bottom-up methods predict all joints then group into instances (OpenPose-style PAFs). HRNet maintains high-resolution feature maps; many mobile demos use MediaPipe Pose. Below: COCO-style overview and a short MediaPipe example.

COCO-17 keypoints (idea)

Order typically includes nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles. Each predicted point has (x, y) and often a confidence; low confidence means occlusion or out-of-frame. Connect pairs with a fixed edge list to render a skeleton.

MediaPipe Pose (Python)

# pip install mediapipe opencv-python
import cv2
import mediapipe as mp

mp_pose = mp.solutions.pose
mp_draw = mp.solutions.drawing_utils

img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
with mp_pose.Pose(static_image_mode=True) as pose:
    res = pose.process(img_rgb)
    if res.pose_landmarks:
        mp_draw.draw_landmarks(
            img_bgr, res.pose_landmarks, mp_pose.POSE_CONNECTIONS)

For video, set static_image_mode=False and reuse the same Pose instance across frames for smoother tracking.

OpenCV DNN (OpenPose-style)

OpenCV samples load Caffe/ONNX multi-branch models that output heatmaps and part affinity fields. You download the model files from the OpenCV GitHub wiki, run net.forward, then decode peaks and associate limbs—more code than MediaPipe but fully offline and customizable.

3D pose

Extends estimation to camera-centered 3D joint coordinates (monocular lifting, multi-view fusion, or depth sensors). Often couples with biomechanics or AR.

Takeaways

  • Normalize crops and augment data for robustness to scale and clothing.
  • Multi-person scenes need association (top-down boxes or bottom-up grouping).
  • Ethics: pose in public spaces raises consent and surveillance concerns.

Quick FAQ

Temporal smoothing (Kalman, exponential moving average) or higher input resolution often helps.

Models may hallucinate hidden joints; use confidence thresholds and temporal consistency checks.