Pre-Training

The first and most expensive stage of building a language model: feeding it billions of pages of text and training it to predict the next token, so that it absorbs the statistical patterns of language, knowledge, and reasoning.


What is it?

When a language model is created, its weights — the billions of numerical parameters that determine its behaviour — start as random numbers. The model knows nothing. It cannot distinguish “the” from “xyzzy.” It cannot predict what follows “The cat sat on the.”

Pre-training is the process that changes this. The model reads enormous amounts of text — books, websites, code repositories, scientific papers, conversations — and for each sequence of tokens, it tries to predict the next token. Every wrong prediction triggers a small adjustment to its weights. Over trillions of predictions, the model builds a statistical map of language: which words follow which, what patterns exist, how ideas connect, and what the world described by that text looks like.1

The result is a base model — a system that is extremely good at continuing text but has no concept of being helpful, safe, or conversational. Ask it a question and it might respond with another question, because that is what text on the internet often does. Turning a base model into something useful requires additional stages: fine-tuning and alignment.2

In plain terms

Pre-training is like a child reading every book in a library. The child does not memorise the books word for word, but after reading enough, they develop an intuition for how language works, what facts are commonly discussed, and what kinds of sentences make sense. The child has not been taught to answer questions yet — they have absorbed patterns.


At a glance


How does it work?

1. The training data — what the model reads

Pre-training datasets are massive. GPT-3 was trained on approximately 300 billion tokens. Llama 3 was trained on over 15 trillion tokens. These datasets are assembled from:1

SourceWhy it matters
Web crawls (Common Crawl)Broad coverage of topics and writing styles
BooksLong-form reasoning, narrative structure
WikipediaFactual knowledge, structured writing
Code repositories (GitHub)Programming patterns, logical structure
Scientific papersTechnical vocabulary, formal reasoning
Conversations and forumsInformal language, question-answer patterns

The data is cleaned to remove duplicates, filter low-quality pages, and balance domains. Data quality directly affects model quality — a model trained on noise produces noise.3

2. The objective — predict the next token

The training objective is deceptively simple: given a sequence of tokens, predict the next one. This is called causal language modelling (or autoregressive pre-training).2

For the sentence “The cat sat on the mat”:

Input (context)Target (next token)
Thecat
The catsat
The cat saton
The cat sat onthe
The cat sat on themat

The model processes the input, produces a probability distribution over its vocabulary, and the training algorithm measures how much probability it assigned to the correct next token. The difference between the model’s prediction and the actual token is the loss.2

3. Backpropagation — learning from mistakes

When the model predicts incorrectly, the loss signal flows backward through the network — a process called backpropagation. Each weight in the model receives a tiny nudge in the direction that would have made the prediction more accurate.1

No single adjustment is meaningful. But over trillions of examples, these nudges accumulate into the patterns we recognise as “knowledge.” The model learns:

  • Syntax: “The cat sat on the” is more likely than “Cat the on sat the”
  • Semantics: After “The doctor prescribed,” medical terms are more probable than cooking terms
  • Facts: “The capital of France is” strongly predicts “Paris”
  • Reasoning patterns: “If X is larger than Y, and Y is larger than Z, then X is larger than” predicts “Z”

Think of it like...

Tuning a musical instrument by ear. You play a note, hear that it is slightly flat, and turn the tuning peg a fraction. Then you play again. After thousands of micro-adjustments, the instrument is in tune. Each individual turn was tiny, but together they produced something accurate. Pre-training adjusts billions of “tuning pegs” (weights) across trillions of “notes” (predictions).

4. Scale — why pre-training is expensive

Pre-training is the most computationally expensive part of building a language model. Training Llama 3 405B required approximately 30 million GPU-hours. At current cloud prices, this represents tens of millions of dollars in compute.3

ResourceScale
Training data1-15 trillion tokens
Parameters7B to 1T+
HardwareThousands of GPUs running in parallel
DurationWeeks to months
CostMillions to hundreds of millions of dollars

This cost is why only a handful of organisations can train frontier models from scratch. Most practitioners work with models that have already been pre-trained and apply fine-tuning to adapt them to specific tasks.4

5. What the base model can and cannot do

After pre-training, the model is a powerful text completion engine. Given any text prefix, it can continue it in a way that is syntactically correct, semantically coherent, and factually plausible (though not guaranteed to be accurate).2

What it can do:

  • Complete sentences and paragraphs naturally
  • Write in many languages and styles
  • Generate code that compiles
  • Reproduce patterns of reasoning found in training data

What it cannot do well:

  • Follow instructions (“Write me a poem about…“)
  • Answer questions directly (it might continue the text with another question)
  • Refuse harmful requests
  • Stay focused on a task

These limitations are addressed by fine-tuning and alignment (RLHF), which are the next stages of training.4


Why do we use it?

Key reasons

1. Knowledge acquisition. Pre-training is the stage where the model absorbs its knowledge of language, facts, and reasoning patterns. Without it, the model is random noise.1

2. Transfer learning. A single pre-trained model can be adapted to thousands of different tasks through fine-tuning. The broad knowledge acquired during pre-training provides a foundation that transfers to specific domains and tasks, without repeating the enormous cost.4

3. Emergent capabilities. At sufficient scale, pre-trained models develop abilities that were never explicitly taught: arithmetic, translation, code generation, abstract reasoning. These emerge from the patterns in language itself.2


When do we use it?

  • When understanding why models know what they know — the training data determines the model’s knowledge, biases, and blind spots
  • When choosing a model — the pre-training data composition (more code vs. more prose, more English vs. multilingual) shapes what the model is good at
  • When understanding the knowledge cutoff — the model only knows about events that occurred before its training data was collected
  • When evaluating model claims — the model does not recall facts; it reproduces patterns. If a “fact” appeared frequently in training data, the model will produce it confidently, whether it is true or not

Rule of thumb

If a model confidently states something you suspect is wrong, remember: pre-training taught it to produce probable text, not true text. Its “knowledge” is statistical, not verified.


How can I think about it?

The apprentice chef

Imagine an apprentice who spends years in a kitchen watching every chef cook, but is never given a specific recipe or instruction.

  • The kitchen = the training data (billions of examples of cooking)
  • Watching and imitating = predicting what the chef does next, adjusting when wrong
  • After years of observation, the apprentice can cook anything they saw — but if you ask them to “make me a birthday cake,” they might start making soup instead, because nobody told them to follow instructions
  • Fine-tuning = giving the apprentice specific recipes and instruction-following practice
  • RLHF = having tasters rank the apprentice’s dishes and adjusting based on preferences

The map from reading

Imagine building a map of a city by reading every guidebook, blog post, and review ever written about it — without ever visiting.

  • Your map = the model’s weights (a compressed representation of everything read)
  • Popular landmarks = well-documented facts that appear many times in training data, represented with high confidence
  • Obscure streets = rare information that appeared infrequently, represented with low confidence or incorrectly
  • The map is not the territory = the model’s “knowledge” is a statistical reconstruction, not direct experience
  • Outdated information = the map reflects what was written at training time; it does not update automatically

Concepts to explore next

ConceptWhat it coversStatus
fine-tuningAdapting a pre-trained model to follow instructions and perform specific taskscomplete
next-token-predictionThe core objective that drives pre-trainingcomplete
transformer-architectureThe neural network architecture that makes large-scale pre-training feasiblecomplete

Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AI[AI and Machine Learning] --> PT[Pre-Training]
    AI --> FT[Fine-Tuning]
    AI --> NTP[Next-Token Prediction]
    AI --> TRA[Transformer Architecture]
    NTP -.->|objective| PT
    TRA -.->|architecture| PT
    PT -->|produces base model| FT
    style PT fill:#4a9ede,color:#fff

Related concepts:

  • fine-tuning — the next training stage after pre-training, which teaches the model to follow instructions and behave helpfully
  • next-token-prediction — the training objective used during pre-training: predict the next token, adjust weights when wrong
  • embeddings — the embedding layer is one of the first things the model learns during pre-training, mapping tokens to meaningful vectors

Sources


Further reading

Resources

Footnotes

  1. Starmorph. (2026). How Large Language Models Work: The Complete Technical Guide. Starmorph Blog. 2 3 4

  2. DataSci Ocean. (2025). The 3 Stages of LLM Training: A Deep Dive into RLHF. DataSci Ocean. 2 3 4 5

  3. Sierra Cloud. (2025). How Large Language Models Work: From tokenization to output. Sierra Cloud. 2

  4. Turing. (2026). What is Fine-Tuning LLM? Methods and Step-by-Step Guide. Turing. 2 3