Re-identification Risk
The danger that data presented as anonymous or aggregated can be linked back to a specific individual — either on its own or when combined with other available information.
What is it?
Re-identification is the process of taking data that was supposed to be anonymous and figuring out who it belongs to. It happens more often than most developers expect, and it happens because true anonymisation is much harder than it looks.
The core problem: removing obvious identifiers (name, email, phone number) is not enough. Research has repeatedly shown that surprisingly few data points — often just three or four — are enough to uniquely identify a person when combined.1 A study by Latanya Sweeney demonstrated that 87% of the US population could be uniquely identified by just three attributes: ZIP code, date of birth, and gender.2
For developers, this matters because every time you display aggregated statistics, publish analytics, or share “anonymised” datasets, you’re making an implicit claim that individuals can’t be identified. If that claim is wrong, the data is personal data under the GDPR and nDSG — with all the legal obligations that entails.
In plain terms
Re-identification risk is like a jigsaw puzzle. Removing someone’s name from a dataset is like removing one piece. But if enough other pieces remain — location, age, behaviour patterns — someone can still see the full picture. True anonymisation means removing enough pieces that the picture can never be reconstructed.
At a glance
The re-identification spectrum (click to expand)
graph LR A[Identified Data] -->|Remove direct identifiers| B[Pseudonymised Data] B -->|Remove indirect identifiers| C[Anonymised Data] C -->|Add noise| D[Differentially Private Data] A -.- E["Full name, email, SSN"] B -.- F["ID replaced by token, but linkable"] C -.- G["Truly unlinkable — no longer personal data"] D -.- H["Mathematically guaranteed privacy"] style C fill:#4a9ede,color:#fff style D fill:#4a9ede,color:#fffKey: Only data at the anonymised or differentially private levels falls outside data protection law. Pseudonymised data is still personal data — it can be re-linked.
How does it work?
Why “removing names” is not enough
Identifiers come in two forms:
| Type | Examples | Risk |
|---|---|---|
| Direct identifiers | Name, email, phone, SSN, photo | Obvious — usually removed first |
| Indirect identifiers (quasi-identifiers) | Age, ZIP code, gender, job title, purchase history, timestamp, IP address | Subtle — often left in, but combinations uniquely identify people |
The risk lies in the combination of indirect identifiers. Each one alone is shared by many people. Together, they narrow down to one.
Think of it like...
Knowing someone lives in Switzerland tells you little. Knowing they live in a small village, are 34, and teach violin — that’s probably one person. Each additional attribute narrows the crowd until only one person remains.
The aggregation trap
Developers often assume that showing aggregated counts is safe: “23 people in your area care about this topic.” But small counts + narrow geography + specific topic can identify individuals:3
| Aggregation | Risk level |
|---|---|
| ”450 people in Vaud…” | Low — large population, broad area |
| ”23 people in Lausanne…” | Medium — smaller group, city level |
| ”3 people in Chardonne…” | High — very small group, specific village |
| ”1 person in Chardonne interested in nuclear waste…” | Critical — that’s an identified individual |
Developer rule of thumb
Never display counts below a threshold (commonly 10). Aggregate at a geographic level broad enough that individuals cannot be inferred. If count < threshold, suppress or generalise.
The three levels of de-identification
1. Pseudonymisation
Replace direct identifiers with tokens or codes. The link between token and identity exists but is stored separately.
- Still personal data under GDPR/nDSG — the link can be reversed
- Useful for internal processing (analytics, testing)
- Not sufficient for publishing or sharing
2. Anonymisation
Irreversibly remove all direct and indirect identifiers so that the data subject can no longer be identified, even by the data controller.
- No longer personal data — falls outside GDPR/nDSG scope
- Extremely difficult to achieve in practice
- Risk: future datasets may enable re-identification of “anonymous” data
3. Differential privacy
Add calibrated statistical noise to data or query results so that the presence or absence of any individual’s data cannot be detected.4
- Mathematically provable privacy guarantee
- Used by Apple (device analytics), Google (Chrome), US Census Bureau
- Trade-off: more noise = more privacy but less data utility
- Best suited for aggregate statistics, not individual records
For example: a community matching feature
You’re building a feature that shows “X others in your area share this interest” to encourage collective action:
Without re-identification awareness:
- Show exact count per commune per topic
- Result: “1 person in Chardonne cares about nuclear waste”
- That person is now identified
With re-identification awareness:
- Set minimum display threshold: 10
- Aggregate at canton or district level, not commune
- Add noise to counts (differential privacy)
- Never combine geography + topic + time too narrowly
- Conduct a DPIA before launch
Concept to explore
Differential privacy is a deep technical topic with its own mathematical foundations. See differential-privacy for a dedicated exploration of epsilon-delta guarantees and practical implementation.
Practical mitigation strategies
| Strategy | What it does | When to use |
|---|---|---|
| Suppression | Remove records where count < k | Publishing any statistics |
| Generalisation | Replace specific values with ranges (age 34 → 30-39) | Sharing datasets |
| K-anonymity | Ensure every record matches at least k-1 others | Dataset release |
| L-diversity | Ensure sensitive attributes have l distinct values per group | Datasets with sensitive columns |
| Differential privacy | Add calibrated noise to results | Aggregate queries, analytics |
| Minimum thresholds | Don’t display counts below n (typically 10) | Any user-facing aggregation |
Why do we use it?
Key reasons
1. Legal obligation. If data can identify someone — even indirectly — it’s personal data under GDPR/nDSG. Treating re-identifiable data as anonymous exposes you to regulatory enforcement.
2. Ethical duty. Re-identification can cause real harm: outing someone’s political views, health conditions, or location to people who should not know.
3. Trust preservation. If users discover that your “anonymous” feature actually exposed them, trust is destroyed and may never recover.
When do we use it?
- When displaying aggregated statistics to users (“X people near you…“)
- When publishing datasets for research or transparency
- When designing analytics dashboards with geographic or demographic breakdowns
- When sharing data with third parties or partners
- When building recommendation systems that use behavioural data
- When conducting a DPIA — re-identification risk is a core assessment dimension
Rule of thumb
Ask yourself: “If I were a motivated adversary with access to public records, could I figure out who contributed to this aggregate?” If the answer is possibly — you haven’t anonymised enough.
How can I think about it?
The masked ball analogy
At a masked ball, everyone wears a mask (pseudonymisation). But if you know someone is 1.95m tall, has red hair, and speaks with a Scottish accent — the mask doesn’t help. Re-identification works the same way: removing the “name tag” doesn’t help if other attributes make someone unique.
True anonymisation means everyone at the ball looks, sounds, and acts identically — which is why it’s so hard to achieve while keeping the data useful. Differential privacy takes a different approach: it lets people attend the ball but randomly swaps some of their visible attributes, so you can never be sure who you’re looking at.
The pixelated photo analogy
Blurring a face in a photo is like pseudonymisation — the face is harder to recognise but not impossible, especially if you know what the person looks like. Low pixelation can often be reversed with AI.
True anonymisation would require pixelating so heavily that the photo is unrecognisable — but then it’s useless as a photo. Differential privacy adds random visual noise across the entire image: the overall scene is recognisable, but no individual face can be extracted.
- Light blur = pseudonymisation (reversible)
- Heavy pixelation = anonymisation (irreversible, low utility)
- Noise overlay = differential privacy (balanced)
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| privacy-by-design | Architectural approach that prevents re-identification risks | complete |
| data-protection-impact-assessment | Formal process to assess re-identification risk | complete |
| personal-data-protection | The legal framework that applies when re-identification is possible | complete |
Some cards don't exist yet
A broken link is a placeholder for future learning, not an error.
Check your understanding
Test yourself (click to expand)
- Explain — Why is pseudonymised data still considered personal data under the GDPR, while truly anonymised data is not?
- Name — What three attributes did Sweeney show could uniquely identify 87% of the US population?
- Distinguish — What is the difference between k-anonymity and differential privacy as re-identification mitigations?
- Interpret — Your feature shows “5 people in Morges are interested in water quality.” Is this safe? What changes would you make?
- Connect — How does re-identification risk influence the architectural decisions you make when designing a database schema for user analytics?
Where this concept fits
Position in the knowledge graph
graph TD A[Data Governance] --> B[Re-identification Risk] A --> C[Privacy by Design] A --> D[DPIA] B --> E[Differential Privacy] B --> F[K-anonymity] B --> G[Pseudonymisation Techniques] style B fill:#4a9ede,color:#fffRelated concepts:
- privacy-by-design — PbD architecture should address re-identification from the start
- data-protection-impact-assessment — the DPIA formally evaluates re-identification risk
- personal-data-protection — re-identifiable data triggers full data protection obligations
Sources
Further reading
Resources
- Re-Identification vs Anonymization Strength — Interactive exploration of how k-anonymity affects re-identification success
- Differential Privacy for AI: Protecting Training Data — Comprehensive guide to differential privacy in AI contexts
- A Comprehensive Guide to Differential Privacy — Academic overview from theory to user expectations
- Differential Privacy: How Apple and Google Add Noise to Protect You — Accessible explanation of how major companies implement DP
Footnotes
-
Testing Branch. (2026). Re-Identification vs Anonymization Strength. Testing Branch. ↩
-
Sweeney, L. (2000). Simple Demographics Often Identify People Uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. ↩
-
AI Safety Directory. (2026). Differential Privacy for AI: Protecting Training Data. AI Safety Directory. ↩
-
Stealth Cloud Intelligence. (2026). Differential Privacy: How Apple and Google Add Noise to Protect You. Stealth Cloud. ↩