MS COCO: 20 Essential Q&A

Question 1

1 What is MS COCO? ⚡ easy

Answer

Answer: Common Objects in Context—benchmark for detection, segmentation, captions, and keypoints with rich scene images.

Question 2

2 Which tasks? 📊 medium

Answer

Answer: Object detection (bbox), instance seg, panoptic (thing+stuff), image captioning, person keypoints—each has metrics.

Question 3

3 JSON annotations? 📊 medium

Answer

Answer: COCO format lists images, categories, annotations with bbox [x,y,w,h], segmentation polygons/RLE, area, iscrowd flag.

Question 4

4 Bbox format? ⚡ easy

Answer

Answer: Top-left x,y plus width,height in pixels—convert carefully vs xyxy conventions in codebases.

Question 5

5 Instance masks? 📊 medium

Answer

Answer: Often stored as RLE compression per object—Mask R-CNN training decodes to binary masks per instance.

Question 6

6 Panoptic on COCO? 🔥 hard

Answer

Answer: Unifies semantic “stuff” and instance “things” with PQ metric—requires non-overlapping label assignment per pixel.

Question 7

7 Captions? 📊 medium

Answer

Answer: Multiple human captions per image—evaluation with BLEU/CIDEr/SPICE; encourages descriptive models.

Question 8

8 Keypoints? 📊 medium

Answer

Answer: 17 body joints for person instances—AP computed with OKS instead of IoU for matching.

Question 9

9 Detection mAP on COCO? 🔥 hard

Answer

Answer: Primary AP averaged over IoU thresholds 0.5:0.05:0.95 (AP@[.5:.95]) plus AP50, AP75—stricter than VOC AP50 only.

Question 10

10 Mask AP? 📊 medium

Answer

Answer: AP computed on mask IoU instead of box IoU—segmentation quality can differ from bbox AP ranking.

Question 11

11 80 classes? ⚡ easy

Answer

Answer: Thing categories for detection—plus stuff classes in panoptic/stuff annotations; don’t confuse with 91 legacy lists in some code.

Question 12

12 train/val/test? 📊 medium

Answer

Answer: train2017, val2017 public; test-dev hidden for leaderboard—papers report val metrics for fair comparison.

Question 13

13 pycocotools? 📊 medium

Answer

Answer: Official eval code for mAP, mask IoU, RLE decode—implementations should match to reproduce leaderboard numbers.

Question 14

14 Small objects? 📊 medium

Answer

Answer: COCO reports AP_S/M/L by area—models struggle on small; anchors/FPN designs target scale variance.

Question 15

15 iscrowd flag? 🔥 hard

Answer

Answer: Annotations for groups/crowds where instance separation ambiguous—evaluation rules ignore or merge per protocol.

Question 16

16 Eval servers? ⚡ easy

Answer

Answer: Upload predictions for test sets—prevents test overfitting; val is for iteration.

Question 17

17 Relation to LVIS? 📊 medium

Answer

Answer: Long-tail vocabulary extension—similar tooling, different frequency spectrum; training often joint with COCO.

Question 18

18 Image source? ⚡ easy

Answer

Answer: Flickr-licensed photos of everyday scenes—more contextual clutter than ImageNet object-centric photos.

Question 19

19 Why default benchmark? 📊 medium

Answer

Answer: Challenging scale, multi-task labels, standardized API—dominant for comparing detectors and instance seg models.

Question 20

20 Common pitfalls? 📊 medium

Answer

Answer: Wrong bbox convention, ignoring iscrowd, different NMS thresholds, or not using official eval—numbers won’t match papers.

Related Computer Vision Links