Data Provenance

The practice of tracking where data comes from, what rights and conditions are attached to it, and what you are legally and ethically permitted to do with it — from its origin through every transformation to its final use.


What is it?

Data provenance answers three deceptively simple questions about every piece of data in your system: Where did it come from? What are the conditions of use? And can I prove both?

In a world where applications routinely aggregate data from APIs, public datasets, web scraping, user input, and AI-generated content, provenance is the practice of maintaining a clear chain of custody. Without it, you might republish data you don’t have the right to use, violate licence terms you never read, or find yourself unable to explain where a contested piece of information came from.1

Provenance is distinct from (but related to) data lineage, which focuses on how data transforms as it flows through your system. Provenance is about origin and rights. Lineage is about processing and transformation. A developer needs both: provenance tells you what you’re allowed to do, lineage tells you what you actually did.2

For developers building applications that consume third-party data — government open data, public APIs, scraped content, AI training sets — provenance is not optional. It’s the difference between a defensible application and a legal liability.

In plain terms

Data provenance is like the label on food packaging. It tells you where the ingredients came from (origin), whether they’re organic or contain allergens (conditions), and who is responsible for the product (accountability). Without the label, you’re serving food of unknown origin to your users.


At a glance


How does it work?

The four dimensions of provenance

For every dataset or data source in your application, you need to know and document four things:

1. Origin — where does it come from?

Source typeExampleProvenance question
Official government dataparliament.ch, admin.chIs this published under OGD? What are the terms?
Public APIREST endpointWhat does the API’s Terms of Service say about reuse?
User-generated contentForm submissionsUser consented to specific uses — which ones?
Scraped/crawled dataWeb pagesrobots.txt + ToS — does the site permit scraping?
AI-generated contentLLM outputWho owns the output? Is the model’s training data licensed?

Think of it like...

A journalist attributing their sources. “A government official told Reuters” has provenance. “Someone said” does not. Your data needs the same rigour.

2. Rights — what am I allowed to do with it?

Not all data that is accessible is reusable. Public availability does not equal open licence.3

Access levelReuse rightsExample
Open data (OGD, CC0, CC-BY)Free to reuse, may require attributionSwiss OGD datasets on opendata.swiss
Public but restrictedAccessible but reuse may be limited by ToSparliament.ch contact data
API with termsReuse governed by API licence/ToSGoogle Maps API
CopyrightedNo reuse without licenceNews articles, proprietary databases
Personal dataGoverned by data protection lawUser profiles, contact details

Developer rule of thumb

Before importing any external data source, find and read the Terms of Use or licence. If there are no terms, assume you cannot reuse it. “It’s on the internet” is not a licence.

3. Metadata — what do I record?

For each data source, maintain a provenance record:

Source:      parliament.ch/members
Type:        Official government data
Licence:     OGD / CC-BY (opendata.swiss terms)
Accessed:    2026-04-04
Update freq: Daily refresh via API
Rights:      Reuse permitted with attribution
Restrictions: Personal emails excluded per nDSG
Contact:     opendata@bk.admin.ch

Concept to explore

Data provenance becomes critical when your application uses AI to generate content based on external data. See ai-content-liability for how provenance intersects with AI-generated content attribution.

4. Chain of custody — can I trace it back?

If someone challenges the accuracy or legality of data in your application, you must be able to trace it back to its source. This requires:

  • Immutable source records — when and where you acquired the data
  • Transformation logs — what processing you applied
  • Version history — what the data looked like when you acquired it vs now
  • Attribution display — showing users where the data comes from

Open data and government data (OGD)

Switzerland’s Open Government Data strategy and the Federal Act on Freedom of Information (BGO) govern how official data can be reused.3

Key principles:

  • Data published under OGD on opendata.swiss is generally freely reusable
  • The BGO grants access to official documents but not automatic reuse rights — access ≠ licence
  • Each data source may have its own terms — check individually
  • If a source forbids reuse, link to it rather than republish it
  • Attribution is almost always required, even for open data

Why do we use it?

Key reasons

1. Legal defensibility. If your data sources are challenged — by a data subject, a rights holder, or a regulator — provenance records are your evidence. Without them, you cannot prove you had the right to use the data.

2. Data quality. Knowing where data comes from lets you assess its reliability. Government source > personal blog > AI-generated. Provenance is a proxy for trustworthiness.

3. Compliance at scale. As your application grows and consumes more data sources, provenance metadata lets you manage rights across hundreds of sources without losing track.


When do we use it?

  • When importing data from any external source (API, dataset, web)
  • When aggregating data from multiple sources into a single view
  • When republishing or displaying third-party data to users
  • When training or fine-tuning AI models on external data
  • When a data subject or rights holder challenges your use of their data
  • When building features that combine user data with external data

Rule of thumb

If you didn’t create the data yourself, document where it came from and what rights you have. This takes five minutes per source at import time and can save months of legal disputes later.


How can I think about it?

The wine label analogy

A bottle of wine carries a label: region, vineyard, grape variety, vintage year, alcohol content, producer. This is provenance. A buyer uses it to judge quality, authenticity, and suitability.

  • Region = data source (parliament.ch, user input, API)
  • Vintage = date accessed or last verified
  • Producer = the entity responsible for the data
  • Grape variety = data type (official record, user-generated, AI-generated)
  • Appellation rules = licence/terms of use

A wine without a label is suspect. Data without provenance is the same — you can’t vouch for its quality, and you can’t prove you acquired it legitimately.

The evidence chain analogy

In criminal law, physical evidence must maintain a “chain of custody” — every person who handled it, every location it was stored, every transfer documented. If the chain is broken, the evidence is inadmissible.

Data provenance is the chain of custody for your application’s data. If you can’t trace a piece of data from your database back to its source through an unbroken chain of records, that data is — from a governance perspective — inadmissible.


Concepts to explore next

ConceptWhat it coversStatus
personal-data-protectionWhat happens when your data sources contain personal datacomplete
ai-content-liabilityProvenance challenges when AI generates contentcomplete
algorithmic-transparencyExplaining what data feeds your algorithmscomplete

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.


Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    A[Data Governance] --> B[Data Provenance]
    A --> C[Personal Data Protection]
    A --> D[AI Content Liability]
    B --> E[Data Lineage]
    B --> F[Open Data Licensing]
    B --> G[Terms of Use Analysis]
    style B fill:#4a9ede,color:#fff

Related concepts:

  • personal-data-protection — provenance records support data protection accountability
  • ai-content-liability — AI-generated content creates novel provenance challenges
  • apis — APIs are a primary channel through which third-party data enters your system

Sources


Further reading

Resources

Footnotes

  1. Snowflake. (2026). Data Provenance vs. Data Lineage: Key Differences and AI Use Cases. Snowflake.

  2. Truescreen. (2026). Data Provenance: Origin Tracking, Lineage and Authenticity. Truescreen.

  3. Swiss Federal Archives, as referenced in the legal compliance analysis for pol.yiuno.org (2026). BGO / OGD Strategy. 2