Pre-Training
The first and most expensive stage of building a language model: feeding it billions of pages of text and training it to predict the next token, so that it absorbs the statistical patterns of language, knowledge, and reasoning.
What is it?
When a language model is created, its weights — the billions of numerical parameters that determine its behaviour — start as random numbers. The model knows nothing. It cannot distinguish “the” from “xyzzy.” It cannot predict what follows “The cat sat on the.”
Pre-training is the process that changes this. The model reads enormous amounts of text — books, websites, code repositories, scientific papers, conversations — and for each sequence of tokens, it tries to predict the next token. Every wrong prediction triggers a small adjustment to its weights. Over trillions of predictions, the model builds a statistical map of language: which words follow which, what patterns exist, how ideas connect, and what the world described by that text looks like.1
The result is a base model — a system that is extremely good at continuing text but has no concept of being helpful, safe, or conversational. Ask it a question and it might respond with another question, because that is what text on the internet often does. Turning a base model into something useful requires additional stages: fine-tuning and alignment.2
In plain terms
Pre-training is like a child reading every book in a library. The child does not memorise the books word for word, but after reading enough, they develop an intuition for how language works, what facts are commonly discussed, and what kinds of sentences make sense. The child has not been taught to answer questions yet — they have absorbed patterns.
At a glance
The pre-training loop (click to expand)
graph LR A[Training data] --> B[Take a sequence] B --> C[Model predicts next token] C --> D{Correct?} D -->|No| E[Compute loss] E --> F[Adjust weights slightly] F --> B D -->|Yes| B style E fill:#e74c3c,color:#fff style F fill:#4a9ede,color:#fffKey: The model reads text sequences, predicts the next token, and adjusts its weights when wrong. This loop runs trillions of times across the entire training dataset.
How does it work?
1. The training data — what the model reads
Pre-training datasets are massive. GPT-3 was trained on approximately 300 billion tokens. Llama 3 was trained on over 15 trillion tokens. These datasets are assembled from:1
| Source | Why it matters |
|---|---|
| Web crawls (Common Crawl) | Broad coverage of topics and writing styles |
| Books | Long-form reasoning, narrative structure |
| Wikipedia | Factual knowledge, structured writing |
| Code repositories (GitHub) | Programming patterns, logical structure |
| Scientific papers | Technical vocabulary, formal reasoning |
| Conversations and forums | Informal language, question-answer patterns |
The data is cleaned to remove duplicates, filter low-quality pages, and balance domains. Data quality directly affects model quality — a model trained on noise produces noise.3
2. The objective — predict the next token
The training objective is deceptively simple: given a sequence of tokens, predict the next one. This is called causal language modelling (or autoregressive pre-training).2
For the sentence “The cat sat on the mat”:
| Input (context) | Target (next token) |
|---|---|
| The | cat |
| The cat | sat |
| The cat sat | on |
| The cat sat on | the |
| The cat sat on the | mat |
The model processes the input, produces a probability distribution over its vocabulary, and the training algorithm measures how much probability it assigned to the correct next token. The difference between the model’s prediction and the actual token is the loss.2
3. Backpropagation — learning from mistakes
When the model predicts incorrectly, the loss signal flows backward through the network — a process called backpropagation. Each weight in the model receives a tiny nudge in the direction that would have made the prediction more accurate.1
No single adjustment is meaningful. But over trillions of examples, these nudges accumulate into the patterns we recognise as “knowledge.” The model learns:
- Syntax: “The cat sat on the” is more likely than “Cat the on sat the”
- Semantics: After “The doctor prescribed,” medical terms are more probable than cooking terms
- Facts: “The capital of France is” strongly predicts “Paris”
- Reasoning patterns: “If X is larger than Y, and Y is larger than Z, then X is larger than” predicts “Z”
Think of it like...
Tuning a musical instrument by ear. You play a note, hear that it is slightly flat, and turn the tuning peg a fraction. Then you play again. After thousands of micro-adjustments, the instrument is in tune. Each individual turn was tiny, but together they produced something accurate. Pre-training adjusts billions of “tuning pegs” (weights) across trillions of “notes” (predictions).
4. Scale — why pre-training is expensive
Pre-training is the most computationally expensive part of building a language model. Training Llama 3 405B required approximately 30 million GPU-hours. At current cloud prices, this represents tens of millions of dollars in compute.3
| Resource | Scale |
|---|---|
| Training data | 1-15 trillion tokens |
| Parameters | 7B to 1T+ |
| Hardware | Thousands of GPUs running in parallel |
| Duration | Weeks to months |
| Cost | Millions to hundreds of millions of dollars |
This cost is why only a handful of organisations can train frontier models from scratch. Most practitioners work with models that have already been pre-trained and apply fine-tuning to adapt them to specific tasks.4
5. What the base model can and cannot do
After pre-training, the model is a powerful text completion engine. Given any text prefix, it can continue it in a way that is syntactically correct, semantically coherent, and factually plausible (though not guaranteed to be accurate).2
What it can do:
- Complete sentences and paragraphs naturally
- Write in many languages and styles
- Generate code that compiles
- Reproduce patterns of reasoning found in training data
What it cannot do well:
- Follow instructions (“Write me a poem about…“)
- Answer questions directly (it might continue the text with another question)
- Refuse harmful requests
- Stay focused on a task
These limitations are addressed by fine-tuning and alignment (RLHF), which are the next stages of training.4
Why do we use it?
Key reasons
1. Knowledge acquisition. Pre-training is the stage where the model absorbs its knowledge of language, facts, and reasoning patterns. Without it, the model is random noise.1
2. Transfer learning. A single pre-trained model can be adapted to thousands of different tasks through fine-tuning. The broad knowledge acquired during pre-training provides a foundation that transfers to specific domains and tasks, without repeating the enormous cost.4
3. Emergent capabilities. At sufficient scale, pre-trained models develop abilities that were never explicitly taught: arithmetic, translation, code generation, abstract reasoning. These emerge from the patterns in language itself.2
When do we use it?
- When understanding why models know what they know — the training data determines the model’s knowledge, biases, and blind spots
- When choosing a model — the pre-training data composition (more code vs. more prose, more English vs. multilingual) shapes what the model is good at
- When understanding the knowledge cutoff — the model only knows about events that occurred before its training data was collected
- When evaluating model claims — the model does not recall facts; it reproduces patterns. If a “fact” appeared frequently in training data, the model will produce it confidently, whether it is true or not
Rule of thumb
If a model confidently states something you suspect is wrong, remember: pre-training taught it to produce probable text, not true text. Its “knowledge” is statistical, not verified.
How can I think about it?
The apprentice chef
Imagine an apprentice who spends years in a kitchen watching every chef cook, but is never given a specific recipe or instruction.
- The kitchen = the training data (billions of examples of cooking)
- Watching and imitating = predicting what the chef does next, adjusting when wrong
- After years of observation, the apprentice can cook anything they saw — but if you ask them to “make me a birthday cake,” they might start making soup instead, because nobody told them to follow instructions
- Fine-tuning = giving the apprentice specific recipes and instruction-following practice
- RLHF = having tasters rank the apprentice’s dishes and adjusting based on preferences
The map from reading
Imagine building a map of a city by reading every guidebook, blog post, and review ever written about it — without ever visiting.
- Your map = the model’s weights (a compressed representation of everything read)
- Popular landmarks = well-documented facts that appear many times in training data, represented with high confidence
- Obscure streets = rare information that appeared infrequently, represented with low confidence or incorrectly
- The map is not the territory = the model’s “knowledge” is a statistical reconstruction, not direct experience
- Outdated information = the map reflects what was written at training time; it does not update automatically
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| fine-tuning | Adapting a pre-trained model to follow instructions and perform specific tasks | complete |
| next-token-prediction | The core objective that drives pre-training | complete |
| transformer-architecture | The neural network architecture that makes large-scale pre-training feasible | complete |
Check your understanding
Test yourself (click to expand)
- Explain what a base model can do after pre-training and what it cannot do. Why does it need further training?
- Describe the pre-training loop in four steps: data in, prediction, loss, weight update.
- Distinguish between the knowledge a model acquires during pre-training and the behaviour it acquires during fine-tuning. Give a concrete example of each.
- Interpret this scenario: a model confidently states that a historical event happened in 1987, but it actually happened in 1989. What does this tell you about how pre-training encodes facts?
- Connect pre-training data composition to model capabilities. Why would a model trained on more code perform better at programming tasks than one trained primarily on prose?
Where this concept fits
Position in the knowledge graph
graph TD AI[AI and Machine Learning] --> PT[Pre-Training] AI --> FT[Fine-Tuning] AI --> NTP[Next-Token Prediction] AI --> TRA[Transformer Architecture] NTP -.->|objective| PT TRA -.->|architecture| PT PT -->|produces base model| FT style PT fill:#4a9ede,color:#fffRelated concepts:
- fine-tuning — the next training stage after pre-training, which teaches the model to follow instructions and behave helpfully
- next-token-prediction — the training objective used during pre-training: predict the next token, adjust weights when wrong
- embeddings — the embedding layer is one of the first things the model learns during pre-training, mapping tokens to meaningful vectors
Sources
Further reading
Resources
- How Large Language Models Work: The Complete Technical Guide (Starmorph) — End-to-end reference covering pre-training, fine-tuning, and inference
- The 3 Stages of LLM Training (DataSci Ocean) — Clear overview of the pre-training → fine-tuning → RLHF pipeline
- The Illustrated Transformer (Jay Alammar) — Visual guide to the architecture that makes pre-training at scale possible
- Fine-Tuning LLMs: Complete Guide (SolGuruz) — Covers what happens after pre-training and how fine-tuning builds on the base model
Footnotes
-
Starmorph. (2026). How Large Language Models Work: The Complete Technical Guide. Starmorph Blog. ↩ ↩2 ↩3 ↩4
-
DataSci Ocean. (2025). The 3 Stages of LLM Training: A Deep Dive into RLHF. DataSci Ocean. ↩ ↩2 ↩3 ↩4 ↩5
-
Sierra Cloud. (2025). How Large Language Models Work: From tokenization to output. Sierra Cloud. ↩ ↩2
-
Turing. (2026). What is Fine-Tuning LLM? Methods and Step-by-Step Guide. Turing. ↩ ↩2 ↩3
