Data Provenance
The practice of tracking where data comes from, what rights and conditions are attached to it, and what you are legally and ethically permitted to do with it — from its origin through every transformation to its final use.
What is it?
Data provenance answers three deceptively simple questions about every piece of data in your system: Where did it come from? What are the conditions of use? And can I prove both?
In a world where applications routinely aggregate data from APIs, public datasets, web scraping, user input, and AI-generated content, provenance is the practice of maintaining a clear chain of custody. Without it, you might republish data you don’t have the right to use, violate licence terms you never read, or find yourself unable to explain where a contested piece of information came from.1
Provenance is distinct from (but related to) data lineage, which focuses on how data transforms as it flows through your system. Provenance is about origin and rights. Lineage is about processing and transformation. A developer needs both: provenance tells you what you’re allowed to do, lineage tells you what you actually did.2
For developers building applications that consume third-party data — government open data, public APIs, scraped content, AI training sets — provenance is not optional. It’s the difference between a defensible application and a legal liability.
In plain terms
Data provenance is like the label on food packaging. It tells you where the ingredients came from (origin), whether they’re organic or contain allergens (conditions), and who is responsible for the product (accountability). Without the label, you’re serving food of unknown origin to your users.
At a glance
The provenance chain (click to expand)
graph LR A[Source] -->|Terms & Licence| B[Acquisition] B -->|Record keeping| C[Storage] C -->|Transformation log| D[Processing] D -->|Attribution & rights| E[Publication] style A fill:#4a9ede,color:#fff style E fill:#4a9ede,color:#fffKey: At each stage, provenance metadata must be preserved: where the data came from, what terms apply, and what transformations occurred. The chain must be unbroken from source to publication.
How does it work?
The four dimensions of provenance
For every dataset or data source in your application, you need to know and document four things:
1. Origin — where does it come from?
| Source type | Example | Provenance question |
|---|---|---|
| Official government data | parliament.ch, admin.ch | Is this published under OGD? What are the terms? |
| Public API | REST endpoint | What does the API’s Terms of Service say about reuse? |
| User-generated content | Form submissions | User consented to specific uses — which ones? |
| Scraped/crawled data | Web pages | robots.txt + ToS — does the site permit scraping? |
| AI-generated content | LLM output | Who owns the output? Is the model’s training data licensed? |
Think of it like...
A journalist attributing their sources. “A government official told Reuters” has provenance. “Someone said” does not. Your data needs the same rigour.
2. Rights — what am I allowed to do with it?
Not all data that is accessible is reusable. Public availability does not equal open licence.3
| Access level | Reuse rights | Example |
|---|---|---|
| Open data (OGD, CC0, CC-BY) | Free to reuse, may require attribution | Swiss OGD datasets on opendata.swiss |
| Public but restricted | Accessible but reuse may be limited by ToS | parliament.ch contact data |
| API with terms | Reuse governed by API licence/ToS | Google Maps API |
| Copyrighted | No reuse without licence | News articles, proprietary databases |
| Personal data | Governed by data protection law | User profiles, contact details |
Developer rule of thumb
Before importing any external data source, find and read the Terms of Use or licence. If there are no terms, assume you cannot reuse it. “It’s on the internet” is not a licence.
3. Metadata — what do I record?
For each data source, maintain a provenance record:
Source: parliament.ch/members
Type: Official government data
Licence: OGD / CC-BY (opendata.swiss terms)
Accessed: 2026-04-04
Update freq: Daily refresh via API
Rights: Reuse permitted with attribution
Restrictions: Personal emails excluded per nDSG
Contact: opendata@bk.admin.ch
Concept to explore
Data provenance becomes critical when your application uses AI to generate content based on external data. See ai-content-liability for how provenance intersects with AI-generated content attribution.
4. Chain of custody — can I trace it back?
If someone challenges the accuracy or legality of data in your application, you must be able to trace it back to its source. This requires:
- Immutable source records — when and where you acquired the data
- Transformation logs — what processing you applied
- Version history — what the data looked like when you acquired it vs now
- Attribution display — showing users where the data comes from
Open data and government data (OGD)
Switzerland’s Open Government Data strategy and the Federal Act on Freedom of Information (BGO) govern how official data can be reused.3
Key principles:
- Data published under OGD on opendata.swiss is generally freely reusable
- The BGO grants access to official documents but not automatic reuse rights — access ≠ licence
- Each data source may have its own terms — check individually
- If a source forbids reuse, link to it rather than republish it
- Attribution is almost always required, even for open data
For example: building a contact directory from government data
You want to aggregate elected officials’ contact details from multiple government sources:
- Check each source’s terms — parliament.ch, admin.ch, cantonal directories each have their own ToS
- Prefer OGD-published datasets — these have the clearest reuse rights
- Record provenance per record — “Councillor X’s email comes from source Y, accessed on date Z, under licence W”
- Display provenance to users — “Source: parliament.ch, last verified 2026-04-04”
- If no reuse right exists — link to the original source page instead of copying the data
Why do we use it?
Key reasons
1. Legal defensibility. If your data sources are challenged — by a data subject, a rights holder, or a regulator — provenance records are your evidence. Without them, you cannot prove you had the right to use the data.
2. Data quality. Knowing where data comes from lets you assess its reliability. Government source > personal blog > AI-generated. Provenance is a proxy for trustworthiness.
3. Compliance at scale. As your application grows and consumes more data sources, provenance metadata lets you manage rights across hundreds of sources without losing track.
When do we use it?
- When importing data from any external source (API, dataset, web)
- When aggregating data from multiple sources into a single view
- When republishing or displaying third-party data to users
- When training or fine-tuning AI models on external data
- When a data subject or rights holder challenges your use of their data
- When building features that combine user data with external data
Rule of thumb
If you didn’t create the data yourself, document where it came from and what rights you have. This takes five minutes per source at import time and can save months of legal disputes later.
How can I think about it?
The wine label analogy
A bottle of wine carries a label: region, vineyard, grape variety, vintage year, alcohol content, producer. This is provenance. A buyer uses it to judge quality, authenticity, and suitability.
- Region = data source (parliament.ch, user input, API)
- Vintage = date accessed or last verified
- Producer = the entity responsible for the data
- Grape variety = data type (official record, user-generated, AI-generated)
- Appellation rules = licence/terms of use
A wine without a label is suspect. Data without provenance is the same — you can’t vouch for its quality, and you can’t prove you acquired it legitimately.
The evidence chain analogy
In criminal law, physical evidence must maintain a “chain of custody” — every person who handled it, every location it was stored, every transfer documented. If the chain is broken, the evidence is inadmissible.
Data provenance is the chain of custody for your application’s data. If you can’t trace a piece of data from your database back to its source through an unbroken chain of records, that data is — from a governance perspective — inadmissible.
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| personal-data-protection | What happens when your data sources contain personal data | complete |
| ai-content-liability | Provenance challenges when AI generates content | complete |
| algorithmic-transparency | Explaining what data feeds your algorithms | complete |
Some cards don't exist yet
A broken link is a placeholder for future learning, not an error.
Check your understanding
Test yourself (click to expand)
- Explain — Why does “publicly available data” not automatically mean “data you can freely reuse”?
- Name — What four pieces of provenance metadata should you record for every external data source?
- Distinguish — What is the difference between data provenance (origin and rights) and data lineage (processing and transformation)?
- Interpret — You find a useful dataset on a government website, but there is no licence or terms of use listed. What should you do?
- Connect — How does data provenance support the “accountability” principle of data protection?
Where this concept fits
Position in the knowledge graph
graph TD A[Data Governance] --> B[Data Provenance] A --> C[Personal Data Protection] A --> D[AI Content Liability] B --> E[Data Lineage] B --> F[Open Data Licensing] B --> G[Terms of Use Analysis] style B fill:#4a9ede,color:#fffRelated concepts:
- personal-data-protection — provenance records support data protection accountability
- ai-content-liability — AI-generated content creates novel provenance challenges
- apis — APIs are a primary channel through which third-party data enters your system
Sources
Further reading
Resources
- Data Provenance vs. Data Lineage: Key Differences and AI Use Cases — Clear comparison of provenance and lineage with practical AI use cases
- Data Provenance: Origin Tracking, Lineage and Authenticity — Comprehensive guide to provenance as a baseline requirement
- Data Lineage in AI Pipelines: Provenance & Compliance — How provenance applies specifically to AI/ML data pipelines
- Implement Dataset Provenance and Licensing for AI Training — Technical tutorial on cryptographic provenance for datasets
Footnotes
-
Snowflake. (2026). Data Provenance vs. Data Lineage: Key Differences and AI Use Cases. Snowflake. ↩
-
Truescreen. (2026). Data Provenance: Origin Tracking, Lineage and Authenticity. Truescreen. ↩
-
Swiss Federal Archives, as referenced in the legal compliance analysis for pol.yiuno.org (2026). BGO / OGD Strategy. ↩ ↩2