CROSS-CUTTING

Sensitive data in research

Data categories requiring extra protection: health, genetic data, sexual orientation, religion, financial status, geolocation. Regulated by LGPD (Brazil), GDPR (EU), HIPAA (US). Anonymization is not a final solution — re-identification is a growing risk.

Extended definition

Sensitive data are specific categories of personal data that receive additional regulatory protections due to their potential to cause discrimination, reputational harm, or damage if exposed: health data, genetic data, biometrics, sexual orientation, gender identity, religion, political opinion, union membership, racial or ethnic origin, financial status, precise geolocation, data of minors. LGPD (General Data Protection Law, Brazil, 2018) explicitly classifies “sensitive personal data” and requires specific consent or specific legal basis for processing. GDPR (General Data Protection Regulation, EU, 2018) has a similar “special category data” under Art. 9. HIPAA (US, 1996) specifically regulates Protected Health Information (PHI) with 18 identifiers that must be removed for a dataset to be considered de-identified. Sweeney (2002, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems) proposed k-anonymity, a formal framework — each record indistinguishable from at least k1k-1 others. Ohm (2010, UCLA Law Review, “Broken Promises of Privacy”) documented re-identification cases in supposedly anonymized datasets (Netflix Prize, AOL search logs, Massachusetts hospital discharge data) — traditional anonymization is not a final solution: combining quasi-identifiers and auxiliary data enables re-identification in many scenarios. Modern frameworks: differential privacy (Dwork, 2006), enhanced k-anonymity, l-diversity, t-closeness.

When it applies

Care with sensitive data applies to research that collects, stores, processes, or shares personal data — effectively all empirical research with humans. It applies to clinical trials with health data; qualitative research with minority identification; social media research (text may reveal political orientation, religion); genomic research with DNA banks; geolocation research (mobile, urban). It applies to research data sharing via FAIR principles: sensitive data require restricted access, not full open data. It applies to international collaboration: research collected in Brazil with US analysis needs LGPD + HIPAA compliance + international data transfer agreement. It applies to ML with clinical, legal, financial data — often direct regulatory requirement.

When it does not apply

It does not apply to non-personalized public data: aggregate official statistics, published case law (with care in cases where identifiable individuals appear). It does not apply to purely theoretical or computational research without personal data. It does not fully apply to old data where identifiable subjects have been deceased for over 50 years (varies by jurisdiction). It does not replace ethical approval: legal compliance and IRB/CEP approval are complementary, not alternatives. It does not apply as a single criterion: data not classified as sensitive can still cause harm if mishandled (e.g., detailed academic data can enable contextual identification).

Applications by field

Health: HIPAA in the US, LGPD in Brazil; encrypted storage, role-based access, audit logs. — Social media research: public data can reveal sensitive attributes; IRB ethics increasingly rigorous. — Genomics: banks like UK Biobank have restricted-use agreements; Personal Genome Project with explicit identifiability consent. — Indigenous research: systems like CARE Principles (Indigenous Collectives) complement FAIR; data sovereignty.

Common pitfalls

The first pitfall is trusting traditional anonymization as sufficient: literature shows that combining quasi-identifiers (sex, age, ZIP code) often re-identifies individuals with high accuracy. The second is failing to document a Data Management Plan: growing funder requirement includes specification of storage, access, retention, destruction. The third is sharing “anonymized” datasets without formal checking: researchers should test k-anonymity or differential privacy before publishing. The fourth is assuming original consent covers secondary uses: many consent terms are specific to the original study; use for other research may require new consent. The fifth is neglecting international data transfer: LGPD and GDPR have specific rules about transfer to countries without adequate protection level; standard contractual clauses are required.

Last updated —