Semantic Segmentation: 20 Essential Q&A

Pixel-wise class labels, encoder–decoder designs, and how we score dense prediction.

~12 min read 20 questions Advanced

FCNU-NetmIoUdice

Quick Navigation

1 What is semantic segmentation? ⚡ easy

Answer: Assigning a class label to every pixel (road, sky, person)—no distinction between different instances of the same class.

2 How does it differ from classification? ⚡ easy

Answer: Classification: one label per image. Semantic segmentation: dense spatial map of labels—requires localization and context.

3 What did FCN change? 📊 medium

Answer: Replaced fully connected layers with 1×1 convolutions so arbitrary input sizes work; learnable upsampling (deconv/transposed conv) to recover resolution.

4 Why U-Net skips? 📊 medium

Answer: Encoder downsamples for context; decoder upsamples; skip connections fuse fine detail from shallow layers with semantic deep features—sharp boundaries.

5 Common upsampling methods? 📊 medium

Answer: Transposed convolution, bilinear upsample + conv, sub-pixel shuffle—each trades artifacts, parameters, and speed differently.

6 What is mIoU? 📊 medium

Answer: Mean Intersection over Union per class (then averaged): measures overlap of predicted vs ground-truth masks—standard benchmark metric.

7 What is Dice coefficient? 📊 medium

Answer: 2|A∩B|/(|A|+|B|)—closely related to F1 for binary masks; common loss for medical segmentation when foreground is tiny.

8 Standard loss? ⚡ easy

Answer: Per-pixel cross-entropy (softmax over classes); can weight rare classes or use focal variants for hard pixels.

9 Why are boundaries hard? 🔥 hard

Answer: Ambiguous edges, thin structures disappear at low res—fixes: deep supervision, boundary-aware loss, high-res branches, or larger input crops.

10 Handle class imbalance? 📊 medium

Answer: Weighted CE, oversampling rare classes, focal loss, dice loss, or balanced sampling in batches.

11 What is ASPP? 🔥 hard

Answer: Atrous spatial pyramid pooling—parallel dilated convs at multiple rates capture multi-scale context without losing resolution (DeepLab family).

12 What is PSPNet idea? 📊 medium

Answer: Pyramid pooling at several scales then upsample and concatenate—rich global scene context for each pixel.

13 Multi-scale inference? 📊 medium

Answer: Run network on several scales / flipped inputs and average logits—boosts mIoU at inference cost.

14 Weakly supervised segmentation? 🔥 hard

Answer: Train from image tags, scribbles, or bounding boxes using constraints (e.g. MIL, GrabCut-style seeds)—less pixel labels needed.

15 Link to panoptic? 📊 medium

Answer: Panoptic adds instance IDs for “things” while semantic handles “stuff”—semantic is a component of full scene parsing.

16 Use CRF post-processing? 📊 medium

Answer: Historically refined CNN outputs with pairwise smoothness; less dominant now with stronger architectures but still taught in interviews.

17 Can semantic separate two people? ⚡ easy

Answer: No—both get label “person”; need instance segmentation for separate masks.

18 Why is data expensive? ⚡ easy

Answer: Pixel-accurate masks per image vs bounding boxes—tools like semi-auto labeling and synthetic data help.

19 Transformers for segmentation? 🔥 hard

Answer: SegFormer, Mask2Former, Segmenter—global attention and mask queries compete with CNN encoders on benchmarks.

20 Real-time models? 📊 medium

Answer: Lightweight backbones (MobileNet), BiSeNet, Fast-SCNN—trade mIoU for FPS on edge devices.

Semantic Segmentation Cheat Sheet

Architecture

Encoder–decoder
Skips (U-Net)

Metric

mIoU
Dice (medical)

Context

ASPP / PSP
Multi-scale test

💡 Pro tip: Dense per-pixel labels; same class shares one semantic mask.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.

Semantic Segmentation Tutorial

Previous Next

Related Computer Vision Links