Human annotation and inter-annotator agreement

Extended definition

Human annotation is the process of manual labeling of data (text, image, audio, video) by human annotators with explicit instructions — basis of practically every supervised dataset in ML. Inter-annotator agreement (IAA, or inter-rater reliability) measures concordance between two or more annotators applying the same scheme, diagnosing consistency of task definition and quality of instructions. Central metrics: simple percent agreement (sensitive to base frequency, inflated in unbalanced classes); Cohen’s kappa (1960, Educational and Psychological Measurement) for two annotators on a nominal scale, correcting for chance agreement; Fleiss’s kappa for multiple annotators; Krippendorff’s alpha for flexibility across variable types (nominal, ordinal, interval) and missing data handling. Artstein and Poesio (2008, Computational Linguistics) offered the canonical technical review for NLP. Practical interpretation: kappa < 0.40 weak; 0.40-0.60 moderate; 0.60-0.80 substantial; > 0.80 near perfect. In critical datasets (health, justice), threshold > 0.70 is often required. Adversarial annotation and consensus annotation (after discussion) are variants for complex topics.

When it applies

Human annotation applies to creation of any supervised dataset: image classification, NER (named entity recognition), sentiment analysis, topic classification, medical image segmentation, audio transcription. IAA applies whenever annotated data will be used to train or evaluate a model: reporting IAA in published datasets is an editorial requirement in top-tier ML/NLP (NeurIPS, ACL, EMNLP). It applies in coded qualitative research (categorized interviews, discourse analysis): IAA brings rigor to categorization. It applies in systematic reviews: two researchers independently screening titles/abstracts and measuring concordance before disagreement resolution. It applies in health research with clinical interpretations (kappa between radiologists, pathologists).

When it does not apply

IAA does not apply to tasks with single indisputable objective ground truth (e.g., what is the sum of two numbers) — agreement is trivial. It does not replace construct validity: annotators can agree on a scheme that does not capture the phenomenon of interest. It does not apply trivially to crowdsourced datasets (Mechanical Turk, Prolific) without careful instruction: inconsistent annotations contaminate the model. It does not apply in domains where labeling requires specialized expertise and availability of multiple specialists is expensive (radiology, jurisprudence) — alternatives include multiple rounds with disagreement discussion. It does not replace model validation: IAA measures label quality, not the performance of models trained on those labels.

Applications by field

— NLP: canonical datasets (CoNLL for NER, SST for sentiment) report IAA; a new dataset published without IAA is editorially questionable. — Computer vision: bounding box annotation in ImageNet, COCO; semantic segmentation is expensive but IAA is fundamental. — Health: kappa between pathologists, radiologists, in histological diagnoses; clinical concordance studies. — Qualitative research: content analysis, open coding with IAA between researchers; Krippendorff’s alpha in media studies.

Common pitfalls

The first pitfall is relying on simple percent agreement in unbalanced classes: 90% agreement in a dataset with 90% majority class can be mere chance. Kappa or alpha are appropriate. The second is not adequately training annotators: ambiguities in instructions produce systematic disagreements that contaminate the dataset. The third is failing to document the annotation protocol: reproducibility requires specification of instructions, boundary examples, tiebreaking rules. The fourth is failing to audit annotator bias: demographically homogeneous annotators can produce labels with systematic representational bias. The fifth is treating annotation as a bottleneck to eliminate: some fields (specialized health, legal annotation) require expertise that no amount of crowdsourcing substitutes.