Pre-Training

The first and most expensive stage of building a language model: feeding it billions of pages of text and training it to predict the next token, so that it absorbs the statistical patterns of language, knowledge, and reasoning.

What is it?

When a language model is created, its weights — the billions of numerical parameters that determine its behaviour — start as random numbers. The model knows nothing. It cannot distinguish “the” from “xyzzy.” It cannot predict what follows “The cat sat on the.”

Pre-training is the process that changes this. The model reads enormous amounts of text — books, websites, code repositories, scientific papers, conversations — and for each sequence of tokens, it tries to predict the next token. Every wrong prediction triggers a small adjustment to its weights. Over trillions of predictions, the model builds a statistical map of language: which words follow which, what patterns exist, how ideas connect, and what the world described by that text looks like.¹

The result is a base model — a system that is extremely good at continuing text but has no concept of being helpful, safe, or conversational. Ask it a question and it might respond with another question, because that is what text on the internet often does. Turning a base model into something useful requires additional stages: fine-tuning and alignment.²

In plain terms

Pre-training is like a child reading every book in a library. The child does not memorise the books word for word, but after reading enough, they develop an intuition for how language works, what facts are commonly discussed, and what kinds of sentences make sense. The child has not been taught to answer questions yet — they have absorbed patterns.

At a glance

The pre-training loop (click to expand)
graph LR
    A[Training data] --> B[Take a sequence]
    B --> C[Model predicts next token]
    C --> D{Correct?}
    D -->|No| E[Compute loss]
    E --> F[Adjust weights slightly]
    F --> B
    D -->|Yes| B

    style E fill:#e74c3c,color:#fff
    style F fill:#4a9ede,color:#fff
Key: The model reads text sequences, predicts the next token, and adjusts its weights when wrong. This loop runs trillions of times across the entire training dataset.

How does it work?

1. The training data — what the model reads

Pre-training datasets are massive. GPT-3 was trained on approximately 300 billion tokens. Llama 3 was trained on over 15 trillion tokens. These datasets are assembled from:¹

Source	Why it matters
Web crawls (Common Crawl)	Broad coverage of topics and writing styles
Books	Long-form reasoning, narrative structure
Wikipedia	Factual knowledge, structured writing
Code repositories (GitHub)	Programming patterns, logical structure
Scientific papers	Technical vocabulary, formal reasoning
Conversations and forums	Informal language, question-answer patterns

The data is cleaned to remove duplicates, filter low-quality pages, and balance domains. Data quality directly affects model quality — a model trained on noise produces noise.³

2. The objective — predict the next token

The training objective is deceptively simple: given a sequence of tokens, predict the next one. This is called causal language modelling (or autoregressive pre-training).²

For the sentence “The cat sat on the mat”:

Input (context)	Target (next token)
The	cat
The cat	sat
The cat sat	on
The cat sat on	the
The cat sat on the	mat

The model processes the input, produces a probability distribution over its vocabulary, and the training algorithm measures how much probability it assigned to the correct next token. The difference between the model’s prediction and the actual token is the loss.²

3. Backpropagation — learning from mistakes

When the model predicts incorrectly, the loss signal flows backward through the network — a process called backpropagation. Each weight in the model receives a tiny nudge in the direction that would have made the prediction more accurate.¹

No single adjustment is meaningful. But over trillions of examples, these nudges accumulate into the patterns we recognise as “knowledge.” The model learns:

Syntax: “The cat sat on the” is more likely than “Cat the on sat the”
Semantics: After “The doctor prescribed,” medical terms are more probable than cooking terms
Facts: “The capital of France is” strongly predicts “Paris”
Reasoning patterns: “If X is larger than Y, and Y is larger than Z, then X is larger than” predicts “Z”

Think of it like...

Tuning a musical instrument by ear. You play a note, hear that it is slightly flat, and turn the tuning peg a fraction. Then you play again. After thousands of micro-adjustments, the instrument is in tune. Each individual turn was tiny, but together they produced something accurate. Pre-training adjusts billions of “tuning pegs” (weights) across trillions of “notes” (predictions).

4. Scale — why pre-training is expensive

Pre-training is the most computationally expensive part of building a language model. Training Llama 3 405B required approximately 30 million GPU-hours. At current cloud prices, this represents tens of millions of dollars in compute.³

Resource	Scale
Training data	1-15 trillion tokens
Parameters	7B to 1T+
Hardware	Thousands of GPUs running in parallel
Duration	Weeks to months
Cost	Millions to hundreds of millions of dollars

This cost is why only a handful of organisations can train frontier models from scratch. Most practitioners work with models that have already been pre-trained and apply fine-tuning to adapt them to specific tasks.⁴

5. What the base model can and cannot do

After pre-training, the model is a powerful text completion engine. Given any text prefix, it can continue it in a way that is syntactically correct, semantically coherent, and factually plausible (though not guaranteed to be accurate).²

What it can do:

Complete sentences and paragraphs naturally
Write in many languages and styles
Generate code that compiles
Reproduce patterns of reasoning found in training data

What it cannot do well:

Follow instructions (“Write me a poem about…“)
Answer questions directly (it might continue the text with another question)
Refuse harmful requests
Stay focused on a task

These limitations are addressed by fine-tuning and alignment (RLHF), which are the next stages of training.⁴

Why do we use it?

Key reasons

1. Knowledge acquisition. Pre-training is the stage where the model absorbs its knowledge of language, facts, and reasoning patterns. Without it, the model is random noise.¹

2. Transfer learning. A single pre-trained model can be adapted to thousands of different tasks through fine-tuning. The broad knowledge acquired during pre-training provides a foundation that transfers to specific domains and tasks, without repeating the enormous cost.⁴

3. Emergent capabilities. At sufficient scale, pre-trained models develop abilities that were never explicitly taught: arithmetic, translation, code generation, abstract reasoning. These emerge from the patterns in language itself.²

When do we use it?

When understanding why models know what they know — the training data determines the model’s knowledge, biases, and blind spots
When choosing a model — the pre-training data composition (more code vs. more prose, more English vs. multilingual) shapes what the model is good at
When understanding the knowledge cutoff — the model only knows about events that occurred before its training data was collected
When evaluating model claims — the model does not recall facts; it reproduces patterns. If a “fact” appeared frequently in training data, the model will produce it confidently, whether it is true or not

Rule of thumb

If a model confidently states something you suspect is wrong, remember: pre-training taught it to produce probable text, not true text. Its “knowledge” is statistical, not verified.

How can I think about it?

The apprentice chef

Imagine an apprentice who spends years in a kitchen watching every chef cook, but is never given a specific recipe or instruction.

The kitchen = the training data (billions of examples of cooking)

Watching and imitating = predicting what the chef does next, adjusting when wrong

After years of observation, the apprentice can cook anything they saw — but if you ask them to “make me a birthday cake,” they might start making soup instead, because nobody told them to follow instructions

Fine-tuning = giving the apprentice specific recipes and instruction-following practice

RLHF = having tasters rank the apprentice’s dishes and adjusting based on preferences

The map from reading

Imagine building a map of a city by reading every guidebook, blog post, and review ever written about it — without ever visiting.

Your map = the model’s weights (a compressed representation of everything read)

Popular landmarks = well-documented facts that appear many times in training data, represented with high confidence

Obscure streets = rare information that appeared infrequently, represented with low confidence or incorrectly

The map is not the territory = the model’s “knowledge” is a statistical reconstruction, not direct experience

Outdated information = the map reflects what was written at training time; it does not update automatically

Concepts to explore next

Concept	What it covers	Status
fine-tuning	Adapting a pre-trained model to follow instructions and perform specific tasks	complete
next-token-prediction	The core objective that drives pre-training	complete
transformer-architecture	The neural network architecture that makes large-scale pre-training feasible	complete

Check your understanding

Test yourself (click to expand)

Explain what a base model can do after pre-training and what it cannot do. Why does it need further training?

Describe the pre-training loop in four steps: data in, prediction, loss, weight update.

Distinguish between the knowledge a model acquires during pre-training and the behaviour it acquires during fine-tuning. Give a concrete example of each.

Interpret this scenario: a model confidently states that a historical event happened in 1987, but it actually happened in 1989. What does this tell you about how pre-training encodes facts?

Connect pre-training data composition to model capabilities. Why would a model trained on more code perform better at programming tasks than one trained primarily on prose?

Where this concept fits

Position in the knowledge graph
graph TD
    AI[AI and Machine Learning] --> PT[Pre-Training]
    AI --> FT[Fine-Tuning]
    AI --> NTP[Next-Token Prediction]
    AI --> TRA[Transformer Architecture]
    NTP -.->|objective| PT
    TRA -.->|architecture| PT
    PT -->|produces base model| FT
    style PT fill:#4a9ede,color:#fff
Related concepts:

fine-tuning — the next training stage after pre-training, which teaches the model to follow instructions and behave helpfully

next-token-prediction — the training objective used during pre-training: predict the next token, adjust weights when wrong

embeddings — the embedding layer is one of the first things the model learns during pre-training, mapping tokens to meaningful vectors

Explorer

Pre-Training

Pre-Training

What is it?

At a glance

How does it work?

1. The training data — what the model reads

2. The objective — predict the next token

3. Backpropagation — learning from mistakes

4. Scale — why pre-training is expensive

5. What the base model can and cannot do

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Pre-Training

Pre-Training

What is it?

At a glance

How does it work?

1. The training data — what the model reads

2. The objective — predict the next token

3. Backpropagation — learning from mistakes

4. Scale — why pre-training is expensive

5. What the base model can and cannot do

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks