Semantic and instance segmentation — Glossary Aria Research

Extended definition

Semantic and instance segmentation are fundamental computer vision tasks operating at pixel granularity — in contrast to classification (which assigns a label to the whole image) or detection (which assigns bounding boxes). Semantic segmentation assigns a class label to each pixel of the image, without distinguishing individual instances: all “car” pixels receive the same label, regardless of whether they are three distinct cars. Long, Shelhamer, and Darrell (2015, CVPR) introduced Fully Convolutional Networks (FCN), replacing fully-connected with convolutional layers and enabling end-to-end segmentation. Dominant architectures evolved: U-Net (Ronneberger et al., 2015, medical), SegNet, DeepLab (Chen et al., 2017, with atrous convolutions and CRF), Mask2Former (2022, transformer-based). Instance segmentation goes further: distinguishes individual objects of the same class — three cars receive distinct masks. He et al. (2017, ICCV) introduced Mask R-CNN, extending Faster R-CNN with a mask head in parallel with classification and bounding box regression; the area’s standard since then. Panoptic segmentation unifies both (Kirillov et al., 2019). Standard metric: mean Intersection over Union (mIoU) computed per class and averaged:

\text{mIoU} = \frac{1}{C} \sum_{c=1}^{C} \frac{|P_c \cap G_c|}{|P_c \cup G_c|}

where $C$ is the number of classes, $P_c$ the prediction, and $G_c$ the ground truth for class $c$ . For instance segmentation, AP (Average Precision) computed with mask IoU is the standard (COCO instance segmentation benchmark).

When it applies

Semantic segmentation applies to tasks requiring precise delineation of regions per pixel: medical image analysis (tumor segmentation in MRI, CT; histopathological tissue segmentation), remote sensing (land-use mapping, deforestation detection, glacier monitoring), precision agriculture (plant segmentation by plot), robotics (scene understanding for navigation), augmented reality (person/object segmentation for overlay). Instance segmentation applies when distinguishing individual objects matters: precise cell counting in microscopy, individual animal identification in camera traps, fruit cluster analysis for robotic harvesting. Panoptic is appropriate when complete scene matters (urban scenes in autonomous vehicles).

When it does not apply

It does not apply when a bounding box is sufficient: YOLO detection is much cheaper and adequate. It does not apply when whole-image classification is the objective: ResNet/ViT are appropriate. It does not apply directly without pixel-annotated training data — annotation is expensive (10-30 minutes per complex image); transfer learning from COCO/Cityscapes/ADE20K-pretrained models + fine-tuning is the standard strategy, but very divergent domains (specialized radiology) may require extensive annotation. It does not apply in video with strong temporal coherence requirements without extension (video segmentation has its own literature). It does not apply as a single tool in tasks requiring deeper semantic understanding (relations between objects, actions) — vision-language models complement.

Applications by field

— Medical imaging: tumor segmentation, organ segmentation, histopathology; benchmarks like BraTS, KiTS. — Remote sensing: land cover segmentation (Sentinel-2, Landsat); environmental monitoring. — Autonomous vehicles: semantic segmentation of urban scenes (Cityscapes); perception pipeline. — Biological research: cell counting in confocal microscopy; plant species identification in the field.

Common pitfalls

The first pitfall is confusing semantic and instance segmentation in counting projects: if the objective is to count individual cells, semantic returns only total area — instance is needed. The second is training on COCO and applying directly to a specialized domain (microscopy, radiology): generic pretraining helps initialization, but fine-tuning on domain data is practically mandatory. The third is reporting only global mIoU while ignoring rare classes: minority classes can have low IoU masked in the average; reporting per class is good editorial practice. The fourth is neglecting annotation cost in the project proposal: a dataset of 1,000 images with 20 classes can require 200-500 hours of specialized annotation. The fifth is assuming a fine mask equates to semantic understanding: a model can segment precisely without capturing the semantic relation relevant to the scientific hypothesis.