Re-identification Risk

The danger that data presented as anonymous or aggregated can be linked back to a specific individual — either on its own or when combined with other available information.

What is it?

Re-identification is the process of taking data that was supposed to be anonymous and figuring out who it belongs to. It happens more often than most developers expect, and it happens because true anonymisation is much harder than it looks.

The core problem: removing obvious identifiers (name, email, phone number) is not enough. Research has repeatedly shown that surprisingly few data points — often just three or four — are enough to uniquely identify a person when combined.¹ A study by Latanya Sweeney demonstrated that 87% of the US population could be uniquely identified by just three attributes: ZIP code, date of birth, and gender.²

For developers, this matters because every time you display aggregated statistics, publish analytics, or share “anonymised” datasets, you’re making an implicit claim that individuals can’t be identified. If that claim is wrong, the data is personal data under the GDPR and nDSG — with all the legal obligations that entails.

In plain terms

Re-identification risk is like a jigsaw puzzle. Removing someone’s name from a dataset is like removing one piece. But if enough other pieces remain — location, age, behaviour patterns — someone can still see the full picture. True anonymisation means removing enough pieces that the picture can never be reconstructed.

At a glance

The re-identification spectrum (click to expand)
graph LR
    A[Identified Data] -->|Remove direct identifiers| B[Pseudonymised Data]
    B -->|Remove indirect identifiers| C[Anonymised Data]
    C -->|Add noise| D[Differentially Private Data]
    A -.- E["Full name, email, SSN"]
    B -.- F["ID replaced by token, but linkable"]
    C -.- G["Truly unlinkable — no longer personal data"]
    D -.- H["Mathematically guaranteed privacy"]
    style C fill:#4a9ede,color:#fff
    style D fill:#4a9ede,color:#fff
Key: Only data at the anonymised or differentially private levels falls outside data protection law. Pseudonymised data is still personal data — it can be re-linked.

How does it work?

Why “removing names” is not enough

Identifiers come in two forms:

Type	Examples	Risk
Direct identifiers	Name, email, phone, SSN, photo	Obvious — usually removed first
Indirect identifiers (quasi-identifiers)	Age, ZIP code, gender, job title, purchase history, timestamp, IP address	Subtle — often left in, but combinations uniquely identify people

The risk lies in the combination of indirect identifiers. Each one alone is shared by many people. Together, they narrow down to one.

Think of it like...

Knowing someone lives in Switzerland tells you little. Knowing they live in a small village, are 34, and teach violin — that’s probably one person. Each additional attribute narrows the crowd until only one person remains.

The aggregation trap

Developers often assume that showing aggregated counts is safe: “23 people in your area care about this topic.” But small counts + narrow geography + specific topic can identify individuals:³

Aggregation	Risk level
”450 people in Vaud…”	Low — large population, broad area
”23 people in Lausanne…”	Medium — smaller group, city level
”3 people in Chardonne…”	High — very small group, specific village
”1 person in Chardonne interested in nuclear waste…”	Critical — that’s an identified individual

Developer rule of thumb

Never display counts below a threshold (commonly 10). Aggregate at a geographic level broad enough that individuals cannot be inferred. If count < threshold, suppress or generalise.

The three levels of de-identification

1. Pseudonymisation

Replace direct identifiers with tokens or codes. The link between token and identity exists but is stored separately.

Still personal data under GDPR/nDSG — the link can be reversed
Useful for internal processing (analytics, testing)
Not sufficient for publishing or sharing

2. Anonymisation

Irreversibly remove all direct and indirect identifiers so that the data subject can no longer be identified, even by the data controller.

No longer personal data — falls outside GDPR/nDSG scope
Extremely difficult to achieve in practice
Risk: future datasets may enable re-identification of “anonymous” data

3. Differential privacy

Add calibrated statistical noise to data or query results so that the presence or absence of any individual’s data cannot be detected.⁴

Mathematically provable privacy guarantee
Used by Apple (device analytics), Google (Chrome), US Census Bureau
Trade-off: more noise = more privacy but less data utility
Best suited for aggregate statistics, not individual records

For example: a community matching feature

You’re building a feature that shows “X others in your area share this interest” to encourage collective action:

Without re-identification awareness:

Show exact count per commune per topic

Result: “1 person in Chardonne cares about nuclear waste”

That person is now identified

With re-identification awareness:

Set minimum display threshold: 10

Aggregate at canton or district level, not commune

Add noise to counts (differential privacy)

Never combine geography + topic + time too narrowly

Conduct a DPIA before launch

Concept to explore

Differential privacy is a deep technical topic with its own mathematical foundations. See differential-privacy for a dedicated exploration of epsilon-delta guarantees and practical implementation.

Practical mitigation strategies

Strategy	What it does	When to use
Suppression	Remove records where count < k	Publishing any statistics
Generalisation	Replace specific values with ranges (age 34 → 30-39)	Sharing datasets
K-anonymity	Ensure every record matches at least k-1 others	Dataset release
L-diversity	Ensure sensitive attributes have l distinct values per group	Datasets with sensitive columns
Differential privacy	Add calibrated noise to results	Aggregate queries, analytics
Minimum thresholds	Don’t display counts below n (typically 10)	Any user-facing aggregation

Why do we use it?

Key reasons

1. Legal obligation. If data can identify someone — even indirectly — it’s personal data under GDPR/nDSG. Treating re-identifiable data as anonymous exposes you to regulatory enforcement.

2. Ethical duty. Re-identification can cause real harm: outing someone’s political views, health conditions, or location to people who should not know.

3. Trust preservation. If users discover that your “anonymous” feature actually exposed them, trust is destroyed and may never recover.

When do we use it?

When displaying aggregated statistics to users (“X people near you…“)
When publishing datasets for research or transparency
When designing analytics dashboards with geographic or demographic breakdowns
When sharing data with third parties or partners
When building recommendation systems that use behavioural data
When conducting a DPIA — re-identification risk is a core assessment dimension

Rule of thumb

Ask yourself: “If I were a motivated adversary with access to public records, could I figure out who contributed to this aggregate?” If the answer is possibly — you haven’t anonymised enough.

How can I think about it?

The masked ball analogy

At a masked ball, everyone wears a mask (pseudonymisation). But if you know someone is 1.95m tall, has red hair, and speaks with a Scottish accent — the mask doesn’t help. Re-identification works the same way: removing the “name tag” doesn’t help if other attributes make someone unique.

True anonymisation means everyone at the ball looks, sounds, and acts identically — which is why it’s so hard to achieve while keeping the data useful. Differential privacy takes a different approach: it lets people attend the ball but randomly swaps some of their visible attributes, so you can never be sure who you’re looking at.

The pixelated photo analogy

Blurring a face in a photo is like pseudonymisation — the face is harder to recognise but not impossible, especially if you know what the person looks like. Low pixelation can often be reversed with AI.

True anonymisation would require pixelating so heavily that the photo is unrecognisable — but then it’s useless as a photo. Differential privacy adds random visual noise across the entire image: the overall scene is recognisable, but no individual face can be extracted.

Light blur = pseudonymisation (reversible)

Heavy pixelation = anonymisation (irreversible, low utility)

Noise overlay = differential privacy (balanced)

Concepts to explore next

Concept	What it covers	Status
privacy-by-design	Architectural approach that prevents re-identification risks	complete
data-protection-impact-assessment	Formal process to assess re-identification risk	complete
personal-data-protection	The legal framework that applies when re-identification is possible	complete

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.

Check your understanding

Test yourself (click to expand)

Explain — Why is pseudonymised data still considered personal data under the GDPR, while truly anonymised data is not?

Name — What three attributes did Sweeney show could uniquely identify 87% of the US population?

Distinguish — What is the difference between k-anonymity and differential privacy as re-identification mitigations?

Interpret — Your feature shows “5 people in Morges are interested in water quality.” Is this safe? What changes would you make?

Connect — How does re-identification risk influence the architectural decisions you make when designing a database schema for user analytics?

Where this concept fits

Position in the knowledge graph
graph TD
    A[Data Governance] --> B[Re-identification Risk]
    A --> C[Privacy by Design]
    A --> D[DPIA]
    B --> E[Differential Privacy]
    B --> F[K-anonymity]
    B --> G[Pseudonymisation Techniques]
    style B fill:#4a9ede,color:#fff
Related concepts:

privacy-by-design — PbD architecture should address re-identification from the start

data-protection-impact-assessment — the DPIA formally evaluates re-identification risk

personal-data-protection — re-identifiable data triggers full data protection obligations

Explorer

Re-identification Risk

Re-identification Risk

What is it?

At a glance

How does it work?

Why “removing names” is not enough

The aggregation trap

The three levels of de-identification

1. Pseudonymisation

2. Anonymisation

3. Differential privacy

Practical mitigation strategies

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Re-identification Risk

Re-identification Risk

What is it?

At a glance

How does it work?

Why “removing names” is not enough

The aggregation trap

The three levels of de-identification

1. Pseudonymisation

2. Anonymisation

3. Differential privacy

Practical mitigation strategies

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks