Instance Segmentation: 20 Essential Q&A

Separate masks per object instance—Mask R-CNN and the overlap problem.

~12 min read 20 questions Advanced

Mask R-CNNROIAlignmask APFCOS

Quick Navigation

1 What is instance segmentation? ⚡ easy

Answer: Each object instance gets its own binary mask and class label—even two “person” pixels belong to different instances if on different people.

2 Semantic vs instance? 📊 medium

Answer: Semantic: one mask per class. Instance: N masks for N objects, possibly same class—handles overlap with distinct IDs.

3 How does Mask R-CNN extend Faster R-CNN? 📊 medium

Answer: Adds parallel mask head: small FCN on each RoI predicts K×K binary mask per class—multi-task with box + class.

4 Why RoIAlign? 🔥 hard

Answer: RoIPool quantizes coordinates → misalignment for masks. RoIAlign uses bilinear sampling at exact float locations—critical for pixel-accurate masks.

5 Mask branch output? 📊 medium

Answer: Typically 28×28 logits upsampled to RoI size with threshold—lightweight per-region FCN.

6 Loss on masks? 📊 medium

Answer: Per-pixel sigmoid + BCE on the target class mask only (not softmax over all classes per pixel in the classic formulation).

7 Can two instance masks overlap in GT? ⚡ easy

Answer: Yes—foreground object in front of another; model must predict ordering or independent masks per instance.

8 Panoptic segmentation? 📊 medium

Answer: Unifies semantic “stuff” and instance “things” with non-overlapping full-scene labeling—each pixel has one label + optional instance id.

9 What is YOLACT? 📊 medium

Answer: One-stage: combines prototype masks with per-instance coefficients for fast instance segmentation—speed-quality tradeoff.

10 SOLO / SOLOv2 idea? 🔥 hard

Answer: Define instance by grid location and scale—predict category and mask for each grid cell without anchors in the traditional sense.

11 DETR for masks? 🔥 hard

Answer: Set prediction with mask head or panoptic head—queries attend to image features to produce instance masks end-to-end.

12 What is mask AP? 📊 medium

Answer: AP computed on mask IoU instead of box IoU—COCO primary metric for instance segmentation quality.

13 Polygon vs raster? ⚡ easy

Answer: Datasets may store COCO RLE or polygons; training often rasterizes to fixed resolution masks for loss.

14 COCO stuff vs things? 📊 medium

Answer: Things are countable instances; stuff is amorphous (grass, sky)—panoptic benchmark merges both.

15 Small instances? 📊 medium

Answer: High-res FPN levels, copy-paste augmentation, and specialized heads help—same challenges as object detection.

16 Why slower than detection? ⚡ easy

Answer: Extra per-RoI mask computation and higher memory—one-stage mask methods aim to close the gap.

17 Role of FPN? 📊 medium

Answer: Multi-scale object proposals and features so small and large instances both get good mask features.

18 HTC / Cascade? 🔥 hard

Answer: Iteratively refine boxes and masks with cascaded stages and inter-task fusion—state-of-art on COCO era leaderboards.

19 Refine boundaries? 🔥 hard

Answer: Methods like PointRend adaptively sample points on uncertain boundaries for fine mask prediction—better edges.

20 Annotation? ⚡ easy

Answer: Instance masks are most expensive—interactive tools, synthetic data, and weak supervision are active research areas.

Instance Segmentation Cheat Sheet

Key model

Mask R-CNN
RoIAlign

Metric

Mask AP

Fast

YOLACT
Query-based

💡 Pro tip: RoIAlign fixes half-pixel misalignment that hurts masks.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.

Instance Segmentation Tutorial

Previous Next

Related Computer Vision Links