Inside the Machine: How Large Language Models Work

You type a sentence. The model completes it. Between those two events, text becomes numbers, numbers become probabilities, and a prediction emerges from billions of learned patterns. This path shows you what happens inside.

Who this is for

You use ChatGPT, Claude, or Copilot regularly. You get useful results, but the mechanism is opaque — you do not know why the model produces what it produces, what “tokens” actually are, or what it means when someone says a model has “70 billion parameters.”

This path is for you if:

You want accurate mental models of what happens between your prompt and the model’s response
You want to understand terms like tokenization, embeddings, attention, and temperature without needing a machine learning degree
You want to make better decisions about when to trust (and when to doubt) model output

What this article is NOT

This is not a machine learning course. You will not train a model or write code. This is a thinking framework for understanding the machinery — so you can reason about it, not just use it.

Part 1 — Text is not what the model sees

You type “The cat sat on the mat.” You see seven words. The model sees something different: a sequence of numbers.

Language models cannot process letters or words directly. They work exclusively with numbers. So the first thing that happens to your text is tokenization — the process of breaking it into smaller units called tokens and assigning each one a numerical ID.¹

A token is not always a word. Common words like “the” or “cat” are usually one token. Less common words get split into pieces. “Tokenization” itself might become [“Token”, “ization”] — two tokens. Rare or technical terms get broken into even smaller fragments.²

graph LR
    A[Your text] --> B[Tokenizer]
    B --> C[Token IDs]
    C --> D[Model]

    style B fill:#4a9ede,color:#fff

How the tokenizer learns to split

The dominant algorithm is byte-pair-encoding (BPE). It works like this:³

Start with every individual character as its own token
Count which pairs of adjacent tokens appear most often in the training text
Merge the most frequent pair into a single new token
Repeat thousands of times

After enough merges, common words become single tokens, common subwords become single tokens, and only truly rare character sequences remain split into individual characters.

Why this matters for you

Tokenization explains several things you have noticed:

Why models struggle with character-level tasks (“count the r’s in strawberry”) — the model never sees individual letters, only token chunks²

Why longer prompts cost more — you pay per token, not per word

Why the same text uses different numbers of tokens on different models — each model has its own tokenizer with its own vocabulary

graph TD
    A[Characters] -->|frequent pairs merge| B[Subwords]
    B -->|frequent pairs merge| C[Common words]
    C -->|vocabulary complete| D[Tokenizer ready]

    style D fill:#5cb85c,color:#fff

Think of it like...

A shorthand system. A stenographer starts by writing every letter individually. Over time, she invents abbreviations for common combinations — “th”, “the”, “tion”. After years of practice, entire common phrases become single symbols. BPE does the same thing, but statistically and at scale.

Part 2 — Numbers all the way down

Tokenization gives the model a sequence of IDs — integers like [464, 3797, 3332, 319, 262, 2603]. But an integer is a label, not a meaning. The number 3797 does not tell the model anything about cats.

This is where embeddings come in. Each token ID is mapped to a vector — a long list of numbers (typically 1,000 to 12,000 numbers) that represents that token’s meaning in mathematical space.⁴

graph LR
    A[Token ID: 3797] --> B[Embedding lookup]
    B --> C["Vector: [0.12, -0.84, 0.33, ...]"]

    style B fill:#4a9ede,color:#fff

The key property: tokens with similar meanings have similar vectors. “Cat” and “kitten” end up close together in this high-dimensional space. “Cat” and “mortgage” end up far apart.⁴

These vectors are not hand-designed. Nobody decided which of the thousands of dimensions should encode “animal-ness” or “size.” The model learns these representations during training, and the meaning emerges from patterns in data — billions of sentences where “cat” and “kitten” appear in similar contexts.⁵

Why this matters for you

Embeddings explain why models understand synonyms, analogies, and even cross-lingual meaning. “Automobile” and “car” have nearby vectors despite sharing no characters. The model works in meaning-space, not word-space.

Go deeper

The embeddings concept card covers vector arithmetic, distance measures, and how embeddings power semantic search and RAG.

Part 3 — The prediction machine

Here is the core of how a language model generates text: it predicts the next token.⁶

Given the sequence “The cat sat on the”, the model does not “think” about what comes next. It computes a probability distribution — a list of every token in its vocabulary with a probability attached.⁷

Token	Probability
mat	0.18
floor	0.12
bed	0.09
table	0.07
roof	0.04
…	…

The model assigns a raw score (called a logit) to each token, then converts all scores into probabilities using a mathematical function called softmax. The probabilities sum to 1.0 — the model is distributing its confidence across the entire vocabulary.⁷

graph LR
    A[Input tokens] --> B[Model processes]
    B --> C[Logits - raw scores]
    C --> D[Softmax]
    D --> E[Probability distribution]
    E --> F[Sample next token]
    F -->|append and repeat| A

    style D fill:#4a9ede,color:#fff
    style F fill:#5cb85c,color:#fff

Temperature: controlling randomness

Temperature is a number that reshapes this probability distribution before the model picks a token.⁸

Low temperature (0.1-0.3): The highest-probability token dominates. Output becomes predictable and focused. “The cat sat on the mat” almost every time.
Temperature 1.0: The original distribution. A balanced mix of likely and less likely tokens.
High temperature (1.5-2.0): Probabilities flatten out. Less likely tokens get a real chance. Output becomes creative but potentially incoherent. “The cat sat on the chandelier.”

The prediction principle

A language model does not “know” anything. It assigns probabilities to every possible next token based on patterns it learned during training. Generation is repeated sampling from these distributions — one token at a time, hundreds of times per response.

Think of it like...

Autocomplete on your phone, but with a vocabulary of 100,000+ tokens and the ability to consider the entire preceding context, not just the last few words. Your phone might suggest three options. The model ranks every token in its vocabulary and picks one.

Part 4 — What parameters actually are

When someone says GPT-4 has “over a trillion parameters” or Llama 3 has “70 billion parameters,” what are they counting?

Parameters are numbers. Specifically, they are the numerical values (called weights) stored in the model’s neural network.⁹ During training, the model adjusts these weights to get better at predicting the next token. After training, the weights are frozen — they become the model’s learned knowledge.

A model with 70 billion parameters has 70 billion individual numbers, each one a small dial that influences how input tokens get transformed into output probabilities. These numbers are organised into layers of matrix operations — structured grids of values that transform vectors as they pass through the network.¹⁰

graph TD
    A[Input embeddings] --> B[Layer 1: transform]
    B --> C[Layer 2: transform]
    C --> D[...]
    D --> E[Layer N: transform]
    E --> F[Output probabilities]

    style A fill:#e8b84b,color:#fff
    style F fill:#5cb85c,color:#fff

Each layer applies its weight matrices to the vectors flowing through, gradually transforming the raw token embeddings into something rich enough to predict what comes next. A model with more parameters has more of these transformations — more capacity to capture patterns.¹⁰

Why this matters for you

Model size (parameter count) is a rough proxy for capacity, not intelligence. A 70B-parameter model can capture more nuanced patterns than a 7B model, but it also costs more to run, responds more slowly, and does not guarantee better results for simple tasks. The relationship between parameters and performance is not linear.

Part 5 — How the model learns

Training happens in stages, each one shaping the model for a different purpose.¹¹

Stage 1: Pre-training

pre-training is the expensive part. The model reads billions of pages of text — books, websites, code, articles — and learns to predict the next token. Every wrong prediction triggers a small adjustment to the weights. Over trillions of predictions, the model builds a statistical map of language: which words follow which, what patterns exist, how ideas connect.⁹

Pre-training produces a base model — one that is good at completing text but has no concept of being helpful, safe, or conversational. Ask it a question and it might continue with another question, because that is what text on the internet often does.¹¹

graph LR
    A[Billions of text pages] --> B[Predict next token]
    B --> C[Wrong? Adjust weights]
    C --> B
    B --> D[Base model]

    style D fill:#4a9ede,color:#fff

Stage 2: Fine-tuning

fine-tuning takes the base model and trains it further on a smaller, curated dataset of examples — typically question-answer pairs, instructions and responses, or domain-specific text. This teaches the model how to respond, not just how to continue text.¹²

After fine-tuning, the model understands that a question expects an answer, that instructions expect completion, and that conversation follows a turn-taking structure.

Stage 3: Alignment (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is the stage that makes models helpful and safe. Human evaluators rank multiple model responses to the same prompt. These rankings train a reward model that scores responses. The language model is then adjusted to produce responses that score higher — responses humans judged as more helpful, more accurate, and less harmful.¹³

graph TD
    A[Pre-training] -->|base model| B[Fine-tuning]
    B -->|instruction model| C[RLHF]
    C -->|aligned model| D[Deployed model]

    style A fill:#e8b84b,color:#fff
    style B fill:#4a9ede,color:#fff
    style C fill:#5cb85c,color:#fff
    style D fill:#9b59b6,color:#fff

The training principle

Pre-training gives the model knowledge. Fine-tuning gives it behaviour. RLHF gives it judgement. The model you interact with has been through all three — it is not raw prediction, but shaped prediction.

Part 6 — Attention: how the model connects ideas

The architecture that makes all of this work is called the Transformer, introduced in 2017.¹⁴ Its key mechanism is self-attention — the ability for every token to look at every other token and decide which ones matter for the current prediction.

Consider the sentence: “The animal didn’t cross the street because it was too tired.”

What does “it” refer to? The animal, or the street? You know instantly — the animal. But for a model processing tokens in sequence, this requires connecting “it” back to “animal” across several intervening tokens.¹⁵

Self-attention solves this. For each token, the model computes three vectors:¹⁴

Query: “What am I looking for?”
Key: “What do I contain?”
Value: “What information do I offer?”

The model compares each token’s Query against every other token’s Key to produce an attention score — how relevant is that other token? High-scoring tokens contribute more of their Value to the current token’s representation.¹⁵

graph TD
    A[it] -->|high attention| B[animal]
    A -->|low attention| C[street]
    A -->|low attention| D[cross]
    A -->|medium attention| E[tired]

    style A fill:#4a9ede,color:#fff
    style B fill:#5cb85c,color:#fff

Multi-head attention

The model does not run one attention calculation. It runs many in parallel — called heads. Each head can specialise: one might track grammatical relationships (subject-verb), another might track semantic relationships (animal-tired), another might track positional proximity.¹⁵

The outputs of all heads are combined, giving the model a rich, multi-perspective view of how every token relates to every other token.

Think of it like...

A spotlight on a stage. When processing the word “it,” the model shines spotlights on every other word in the sentence. “Animal” gets a bright spotlight (high attention). “Street” gets a dim one. Multiple spotlights (heads) can illuminate different relationships simultaneously — one for grammar, one for meaning, one for proximity.

Part 7 — The full picture

From prompt to response, every step is a numerical transformation:

graph TD
    A[Your text] --> B[Tokenizer: text to token IDs]
    B --> C[Embedding: IDs to vectors]
    C --> D[Transformer layers]
    D --> E[Self-attention:<br/>tokens attend to each other]
    D --> F[Feed-forward:<br/>transform representations]
    E --> G[Output layer]
    F --> G
    G --> H[Probability distribution<br/>over vocabulary]
    H --> I[Sample next token]
    I -->|append and repeat| B

    style D fill:#4a9ede,color:#fff
    style H fill:#e8b84b,color:#fff
    style I fill:#5cb85c,color:#fff

What you now understand

Mental models you have gained

Tokenization — text is split into subword tokens by BPE before the model sees it; the model works with numbers, not words

Embeddings — each token becomes a high-dimensional vector that encodes meaning; similar meanings sit close together in vector space

Next-token prediction — the model assigns a probability to every token in its vocabulary and samples the next one; generation is this process repeated

Temperature — a dial that reshapes the probability distribution: low for focused output, high for creative output

Parameters — billions of numerical weights that the model learned during training; more parameters means more capacity, not necessarily better results

Three-stage training — pre-training (knowledge), fine-tuning (behaviour), RLHF (alignment)

Self-attention — the mechanism that lets every token consider every other token, enabling the model to connect ideas across long distances

Check your understanding

Test yourself before moving on (click to expand)

Explain why a language model might struggle to count the letters in a word. What happens to the word before the model processes it?

Describe the three stages of LLM training and what each stage contributes. Why is the order important?

Distinguish between a logit and a probability. What mathematical function converts one into the other, and why is this conversion necessary?

Interpret this scenario: you set temperature to 0.1 and the model produces nearly identical output every time you run the same prompt. Explain why, in terms of the probability distribution.

Design a simple analogy to explain self-attention to someone who has never heard of transformers. Your analogy should capture the Query-Key-Value mechanism and the idea of multiple heads.

Where to go next

I want to learn how to use LLMs effectively

Now that you understand the machinery, learn how to work with it: prompting, context engineering, and harness engineering. Read llm-engineering.

Best for: People who want to move from understanding to practice.

I want to understand AI systems architecture

LLMs are one component. If you want to understand how agents, orchestration, routing, and pipelines fit together, read agentic-design.

Best for: People building AI-powered products or workflows.

I want to understand the software underneath

LLMs run on software infrastructure. If terms like API, frontend, backend, and database are still unclear, start with from-zero-to-building.

Best for: People who want the full stack picture.

Explorer

Inside the Machine: How Large Language Models Work

Inside the Machine: How Large Language Models Work

Who this is for

Part 1 — Text is not what the model sees

How the tokenizer learns to split

Part 2 — Numbers all the way down

Part 3 — The prediction machine

Temperature: controlling randomness

Part 4 — What parameters actually are

Part 5 — How the model learns

Stage 1: Pre-training

Stage 2: Fine-tuning

Stage 3: Alignment (RLHF)

Part 6 — Attention: how the model connects ideas

Multi-head attention

Part 7 — The full picture

What you now understand

Check your understanding

Where to go next

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Inside the Machine: How Large Language Models Work

Inside the Machine: How Large Language Models Work

Who this is for

Part 1 — Text is not what the model sees

How the tokenizer learns to split

Part 2 — Numbers all the way down

Part 3 — The prediction machine

Temperature: controlling randomness

Part 4 — What parameters actually are

Part 5 — How the model learns

Stage 1: Pre-training

Stage 2: Fine-tuning

Stage 3: Alignment (RLHF)

Part 6 — Attention: how the model connects ideas

Multi-head attention

Part 7 — The full picture

What you now understand

Check your understanding

Where to go next

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks