Data Provenance

The practice of tracking where data comes from, what rights and conditions are attached to it, and what you are legally and ethically permitted to do with it — from its origin through every transformation to its final use.

What is it?

Data provenance answers three deceptively simple questions about every piece of data in your system: Where did it come from? What are the conditions of use? And can I prove both?

In a world where applications routinely aggregate data from APIs, public datasets, web scraping, user input, and AI-generated content, provenance is the practice of maintaining a clear chain of custody. Without it, you might republish data you don’t have the right to use, violate licence terms you never read, or find yourself unable to explain where a contested piece of information came from.¹

Provenance is distinct from (but related to) data lineage, which focuses on how data transforms as it flows through your system. Provenance is about origin and rights. Lineage is about processing and transformation. A developer needs both: provenance tells you what you’re allowed to do, lineage tells you what you actually did.²

For developers building applications that consume third-party data — government open data, public APIs, scraped content, AI training sets — provenance is not optional. It’s the difference between a defensible application and a legal liability.

In plain terms

Data provenance is like the label on food packaging. It tells you where the ingredients came from (origin), whether they’re organic or contain allergens (conditions), and who is responsible for the product (accountability). Without the label, you’re serving food of unknown origin to your users.

At a glance

The provenance chain (click to expand)
graph LR
    A[Source] -->|Terms & Licence| B[Acquisition]
    B -->|Record keeping| C[Storage]
    C -->|Transformation log| D[Processing]
    D -->|Attribution & rights| E[Publication]
    style A fill:#4a9ede,color:#fff
    style E fill:#4a9ede,color:#fff
Key: At each stage, provenance metadata must be preserved: where the data came from, what terms apply, and what transformations occurred. The chain must be unbroken from source to publication.

How does it work?

The four dimensions of provenance

For every dataset or data source in your application, you need to know and document four things:

1. Origin — where does it come from?

Source type	Example	Provenance question
Official government data	parliament.ch, admin.ch	Is this published under OGD? What are the terms?
Public API	REST endpoint	What does the API’s Terms of Service say about reuse?
User-generated content	Form submissions	User consented to specific uses — which ones?
Scraped/crawled data	Web pages	robots.txt + ToS — does the site permit scraping?
AI-generated content	LLM output	Who owns the output? Is the model’s training data licensed?

Think of it like...

A journalist attributing their sources. “A government official told Reuters” has provenance. “Someone said” does not. Your data needs the same rigour.

2. Rights — what am I allowed to do with it?

Not all data that is accessible is reusable. Public availability does not equal open licence.³

Access level	Reuse rights	Example
Open data (OGD, CC0, CC-BY)	Free to reuse, may require attribution	Swiss OGD datasets on opendata.swiss
Public but restricted	Accessible but reuse may be limited by ToS	parliament.ch contact data
API with terms	Reuse governed by API licence/ToS	Google Maps API
Copyrighted	No reuse without licence	News articles, proprietary databases
Personal data	Governed by data protection law	User profiles, contact details

Developer rule of thumb

Before importing any external data source, find and read the Terms of Use or licence. If there are no terms, assume you cannot reuse it. “It’s on the internet” is not a licence.

3. Metadata — what do I record?

For each data source, maintain a provenance record:

Source:      parliament.ch/members
Type:        Official government data
Licence:     OGD / CC-BY (opendata.swiss terms)
Accessed:    2026-04-04
Update freq: Daily refresh via API
Rights:      Reuse permitted with attribution
Restrictions: Personal emails excluded per nDSG
Contact:     opendata@bk.admin.ch

Concept to explore

Data provenance becomes critical when your application uses AI to generate content based on external data. See ai-content-liability for how provenance intersects with AI-generated content attribution.

4. Chain of custody — can I trace it back?

If someone challenges the accuracy or legality of data in your application, you must be able to trace it back to its source. This requires:

Immutable source records — when and where you acquired the data
Transformation logs — what processing you applied
Version history — what the data looked like when you acquired it vs now
Attribution display — showing users where the data comes from

Open data and government data (OGD)

Switzerland’s Open Government Data strategy and the Federal Act on Freedom of Information (BGO) govern how official data can be reused.³

Key principles:

Data published under OGD on opendata.swiss is generally freely reusable
The BGO grants access to official documents but not automatic reuse rights — access ≠ licence
Each data source may have its own terms — check individually
If a source forbids reuse, link to it rather than republish it
Attribution is almost always required, even for open data

For example: building a contact directory from government data

You want to aggregate elected officials’ contact details from multiple government sources:

Check each source’s terms — parliament.ch, admin.ch, cantonal directories each have their own ToS

Prefer OGD-published datasets — these have the clearest reuse rights

Record provenance per record — “Councillor X’s email comes from source Y, accessed on date Z, under licence W”

Display provenance to users — “Source: parliament.ch, last verified 2026-04-04”

If no reuse right exists — link to the original source page instead of copying the data

Why do we use it?

Key reasons

1. Legal defensibility. If your data sources are challenged — by a data subject, a rights holder, or a regulator — provenance records are your evidence. Without them, you cannot prove you had the right to use the data.

2. Data quality. Knowing where data comes from lets you assess its reliability. Government source > personal blog > AI-generated. Provenance is a proxy for trustworthiness.

3. Compliance at scale. As your application grows and consumes more data sources, provenance metadata lets you manage rights across hundreds of sources without losing track.

When do we use it?

When importing data from any external source (API, dataset, web)
When aggregating data from multiple sources into a single view
When republishing or displaying third-party data to users
When training or fine-tuning AI models on external data
When a data subject or rights holder challenges your use of their data
When building features that combine user data with external data

Rule of thumb

If you didn’t create the data yourself, document where it came from and what rights you have. This takes five minutes per source at import time and can save months of legal disputes later.

How can I think about it?

The wine label analogy

A bottle of wine carries a label: region, vineyard, grape variety, vintage year, alcohol content, producer. This is provenance. A buyer uses it to judge quality, authenticity, and suitability.

Region = data source (parliament.ch, user input, API)

Vintage = date accessed or last verified

Producer = the entity responsible for the data

Grape variety = data type (official record, user-generated, AI-generated)

Appellation rules = licence/terms of use

A wine without a label is suspect. Data without provenance is the same — you can’t vouch for its quality, and you can’t prove you acquired it legitimately.

The evidence chain analogy

In criminal law, physical evidence must maintain a “chain of custody” — every person who handled it, every location it was stored, every transfer documented. If the chain is broken, the evidence is inadmissible.

Data provenance is the chain of custody for your application’s data. If you can’t trace a piece of data from your database back to its source through an unbroken chain of records, that data is — from a governance perspective — inadmissible.

Concepts to explore next

Concept	What it covers	Status
personal-data-protection	What happens when your data sources contain personal data	complete
ai-content-liability	Provenance challenges when AI generates content	complete
algorithmic-transparency	Explaining what data feeds your algorithms	complete

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.

Check your understanding

Test yourself (click to expand)

Explain — Why does “publicly available data” not automatically mean “data you can freely reuse”?

Name — What four pieces of provenance metadata should you record for every external data source?

Distinguish — What is the difference between data provenance (origin and rights) and data lineage (processing and transformation)?

Interpret — You find a useful dataset on a government website, but there is no licence or terms of use listed. What should you do?

Connect — How does data provenance support the “accountability” principle of data protection?

Where this concept fits

Position in the knowledge graph
graph TD
    A[Data Governance] --> B[Data Provenance]
    A --> C[Personal Data Protection]
    A --> D[AI Content Liability]
    B --> E[Data Lineage]
    B --> F[Open Data Licensing]
    B --> G[Terms of Use Analysis]
    style B fill:#4a9ede,color:#fff
Related concepts:

personal-data-protection — provenance records support data protection accountability

ai-content-liability — AI-generated content creates novel provenance challenges

apis — APIs are a primary channel through which third-party data enters your system

Explorer

Data Provenance

Data Provenance

What is it?

At a glance

How does it work?

The four dimensions of provenance

1. Origin — where does it come from?

2. Rights — what am I allowed to do with it?

3. Metadata — what do I record?

4. Chain of custody — can I trace it back?

Open data and government data (OGD)

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Data Provenance

Data Provenance

What is it?

At a glance

How does it work?

The four dimensions of provenance

1. Origin — where does it come from?

2. Rights — what am I allowed to do with it?

3. Metadata — what do I record?

4. Chain of custody — can I trace it back?

Open data and government data (OGD)

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks