YOLO (You Only Look Once) — Glossary Aria Research

Extended definition

YOLO (You Only Look Once) is a family of real-time object detection models that reformulated detection as direct regression: in a single image pass through the neural network, the model simultaneously predicts bounding boxes, detection confidence, and object classes. Redmon et al. (2016, CVPR) introduced the approach in contrast to previous two-stage pipelines (R-CNN, Fast R-CNN, Faster R-CNN) that first proposed regions and then classified. Central metric: Intersection over Union (IoU) between predicted and ground-truth bounding boxes:

\text{IoU} = \frac{|A \cap B|}{|A \cup B|}

where $A$ is the predicted box and $B$ the ground truth. Typical positive-detection threshold: IoU > 0.5. mAP (mean Average Precision) is the standard aggregate metric, computing AP per class and averaging — mAP@0.5 and mAP@0.5:0.95 (averaged across multiple IoU thresholds) are standard editorial reports at CVPR/ICCV. Family evolution: YOLOv1 (2016), YOLOv2/v3 (Redmon, 2017-2018), YOLOv4 (Bochkovskiy et al., 2020, arXiv 2004.10934), YOLOv5 (Ultralytics, 2020), YOLOv7-v9 (2022-2024), YOLOv11 (2024) — each version optimizes the speed-accuracy trade-off. Dominant implementations: Ultralytics YOLO (PyTorch), original Darknet. Massive production use: surveillance, autonomous vehicles, robotics, retail, agriculture.

When it applies

YOLO applies to any object detection task combining real-time or high-throughput constraints with the need for bounding boxes. It is standard in surveillance and security systems, autonomous vehicles for pedestrian/vehicle/traffic-sign detection, robotics for identifying manipulable objects, automated retail (Amazon Go), precision agriculture (pest, fruit detection), sports (player and ball tracking), scientific research (cell counting in microscopy, animal identification in camera traps, drone-based ecological monitoring). It applies in integrated pipelines with other stages: YOLO detection + CNN/CLIP recognition in multimodal systems.

When it does not apply

It does not apply directly to fine-grained segmentation (pixel-by-pixel): bounding boxes are coarse; Mask R-CNN or DeepLab are appropriate. It does not apply to detection of very small objects relative to image (drones in sky, tiny cells in wide micrography) without architectural adjustment — small scales are a chronic challenge in all versions. It does not apply directly to video detection with strong temporal dependencies between frames: YOLO is per-frame; tracking (DeepSORT, ByteTrack) complements. It does not apply in domains with highly overlapping objects where a single box does not capture geometry (dense crowds, fruits in clusters) without specific post-processing. It does not apply as a single solution in domains without appropriate training data: domain fine-tuning is generally needed (even COCO-pretrained models need fine-tuning for specific domains).

Applications by field

— Autonomous vehicles: pedestrian, vehicle, sign detection; YOLO is common in perception stacks. — Surveillance and security: real-time monitoring with IP cameras; integration with tracking. — Scientific research: automated counting in microscopy, ecology (camera traps), precision agriculture. — Robotics and industrial automation: object identification for manipulation; visual quality control.

Common pitfalls

The first pitfall is reporting only mAP without considering the speed-accuracy trade-off: YOLO dominates the speed-accuracy frontier; comparing to Faster R-CNN on mAP alone is incomplete. The second is training on generic COCO and applying directly to a specialized domain: pretraining is a starting point; fine-tuning is almost always necessary. The third is uncritically using IoU = 0.5 threshold: applications requiring precise localization (robotic surgery, fine handling) need IoU > 0.7. The fourth is failing to audit training-dataset bias: COCO has documented representational bias (Western-cultural common objects over-represented); models inherit it. The fifth is confusing versions: YOLOv5/v7/v8/v11 have distinct communities, formats, and licenses — confusing documentation produces subtle errors in fine-tuning.