Re-identification Risk

The danger that data presented as anonymous or aggregated can be linked back to a specific individual — either on its own or when combined with other available information.


What is it?

Re-identification is the process of taking data that was supposed to be anonymous and figuring out who it belongs to. It happens more often than most developers expect, and it happens because true anonymisation is much harder than it looks.

The core problem: removing obvious identifiers (name, email, phone number) is not enough. Research has repeatedly shown that surprisingly few data points — often just three or four — are enough to uniquely identify a person when combined.1 A study by Latanya Sweeney demonstrated that 87% of the US population could be uniquely identified by just three attributes: ZIP code, date of birth, and gender.2

For developers, this matters because every time you display aggregated statistics, publish analytics, or share “anonymised” datasets, you’re making an implicit claim that individuals can’t be identified. If that claim is wrong, the data is personal data under the GDPR and nDSG — with all the legal obligations that entails.

In plain terms

Re-identification risk is like a jigsaw puzzle. Removing someone’s name from a dataset is like removing one piece. But if enough other pieces remain — location, age, behaviour patterns — someone can still see the full picture. True anonymisation means removing enough pieces that the picture can never be reconstructed.


At a glance


How does it work?

Why “removing names” is not enough

Identifiers come in two forms:

TypeExamplesRisk
Direct identifiersName, email, phone, SSN, photoObvious — usually removed first
Indirect identifiers (quasi-identifiers)Age, ZIP code, gender, job title, purchase history, timestamp, IP addressSubtle — often left in, but combinations uniquely identify people

The risk lies in the combination of indirect identifiers. Each one alone is shared by many people. Together, they narrow down to one.

Think of it like...

Knowing someone lives in Switzerland tells you little. Knowing they live in a small village, are 34, and teach violin — that’s probably one person. Each additional attribute narrows the crowd until only one person remains.

The aggregation trap

Developers often assume that showing aggregated counts is safe: “23 people in your area care about this topic.” But small counts + narrow geography + specific topic can identify individuals:3

AggregationRisk level
”450 people in Vaud…”Low — large population, broad area
”23 people in Lausanne…”Medium — smaller group, city level
”3 people in Chardonne…”High — very small group, specific village
”1 person in Chardonne interested in nuclear waste…”Critical — that’s an identified individual

Developer rule of thumb

Never display counts below a threshold (commonly 10). Aggregate at a geographic level broad enough that individuals cannot be inferred. If count < threshold, suppress or generalise.

The three levels of de-identification

1. Pseudonymisation

Replace direct identifiers with tokens or codes. The link between token and identity exists but is stored separately.

  • Still personal data under GDPR/nDSG — the link can be reversed
  • Useful for internal processing (analytics, testing)
  • Not sufficient for publishing or sharing

2. Anonymisation

Irreversibly remove all direct and indirect identifiers so that the data subject can no longer be identified, even by the data controller.

  • No longer personal data — falls outside GDPR/nDSG scope
  • Extremely difficult to achieve in practice
  • Risk: future datasets may enable re-identification of “anonymous” data

3. Differential privacy

Add calibrated statistical noise to data or query results so that the presence or absence of any individual’s data cannot be detected.4

  • Mathematically provable privacy guarantee
  • Used by Apple (device analytics), Google (Chrome), US Census Bureau
  • Trade-off: more noise = more privacy but less data utility
  • Best suited for aggregate statistics, not individual records

Concept to explore

Differential privacy is a deep technical topic with its own mathematical foundations. See differential-privacy for a dedicated exploration of epsilon-delta guarantees and practical implementation.

Practical mitigation strategies

StrategyWhat it doesWhen to use
SuppressionRemove records where count < kPublishing any statistics
GeneralisationReplace specific values with ranges (age 34 → 30-39)Sharing datasets
K-anonymityEnsure every record matches at least k-1 othersDataset release
L-diversityEnsure sensitive attributes have l distinct values per groupDatasets with sensitive columns
Differential privacyAdd calibrated noise to resultsAggregate queries, analytics
Minimum thresholdsDon’t display counts below n (typically 10)Any user-facing aggregation

Why do we use it?

Key reasons

1. Legal obligation. If data can identify someone — even indirectly — it’s personal data under GDPR/nDSG. Treating re-identifiable data as anonymous exposes you to regulatory enforcement.

2. Ethical duty. Re-identification can cause real harm: outing someone’s political views, health conditions, or location to people who should not know.

3. Trust preservation. If users discover that your “anonymous” feature actually exposed them, trust is destroyed and may never recover.


When do we use it?

  • When displaying aggregated statistics to users (“X people near you…“)
  • When publishing datasets for research or transparency
  • When designing analytics dashboards with geographic or demographic breakdowns
  • When sharing data with third parties or partners
  • When building recommendation systems that use behavioural data
  • When conducting a DPIA — re-identification risk is a core assessment dimension

Rule of thumb

Ask yourself: “If I were a motivated adversary with access to public records, could I figure out who contributed to this aggregate?” If the answer is possibly — you haven’t anonymised enough.


How can I think about it?

The masked ball analogy

At a masked ball, everyone wears a mask (pseudonymisation). But if you know someone is 1.95m tall, has red hair, and speaks with a Scottish accent — the mask doesn’t help. Re-identification works the same way: removing the “name tag” doesn’t help if other attributes make someone unique.

True anonymisation means everyone at the ball looks, sounds, and acts identically — which is why it’s so hard to achieve while keeping the data useful. Differential privacy takes a different approach: it lets people attend the ball but randomly swaps some of their visible attributes, so you can never be sure who you’re looking at.

The pixelated photo analogy

Blurring a face in a photo is like pseudonymisation — the face is harder to recognise but not impossible, especially if you know what the person looks like. Low pixelation can often be reversed with AI.

True anonymisation would require pixelating so heavily that the photo is unrecognisable — but then it’s useless as a photo. Differential privacy adds random visual noise across the entire image: the overall scene is recognisable, but no individual face can be extracted.

  • Light blur = pseudonymisation (reversible)
  • Heavy pixelation = anonymisation (irreversible, low utility)
  • Noise overlay = differential privacy (balanced)

Concepts to explore next

ConceptWhat it coversStatus
privacy-by-designArchitectural approach that prevents re-identification riskscomplete
data-protection-impact-assessmentFormal process to assess re-identification riskcomplete
personal-data-protectionThe legal framework that applies when re-identification is possiblecomplete

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.


Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    A[Data Governance] --> B[Re-identification Risk]
    A --> C[Privacy by Design]
    A --> D[DPIA]
    B --> E[Differential Privacy]
    B --> F[K-anonymity]
    B --> G[Pseudonymisation Techniques]
    style B fill:#4a9ede,color:#fff

Related concepts:


Sources


Further reading

Resources

Footnotes

  1. Testing Branch. (2026). Re-Identification vs Anonymization Strength. Testing Branch.

  2. Sweeney, L. (2000). Simple Demographics Often Identify People Uniquely. Carnegie Mellon University, Data Privacy Working Paper 3.

  3. AI Safety Directory. (2026). Differential Privacy for AI: Protecting Training Data. AI Safety Directory.

  4. Stealth Cloud Intelligence. (2026). Differential Privacy: How Apple and Google Add Noise to Protect You. Stealth Cloud.