Transformer Architecture

The neural network architecture behind modern language models, built around a mechanism called self-attention that allows every token to consider every other token when computing its meaning.

What is it?

Before 2017, language models processed text sequentially — one word at a time, left to right, like reading a sentence through a keyhole that slides one word forward per step. This made them slow and forgetful. By the time the model reached the end of a long sentence, it had a faded memory of the beginning.¹

The Transformer, introduced in the paper “Attention Is All You Need” by Vaswani et al. at Google, replaced this sequential approach with a fundamentally different idea: let every token look at every other token, all at once. Instead of reading through a keyhole, the model sees the entire sentence simultaneously and decides which words are relevant to which.¹

This mechanism — called self-attention — is the core innovation. It allows the model to connect distant words (“The animal didn’t cross the street because it was too tired” — what does “it” refer to?), handle long-range dependencies, and process all tokens in parallel during training. Every major language model today — GPT, Claude, Llama, Gemini — is a Transformer.²

In plain terms

A Transformer is a neural network that reads an entire text at once and figures out which words should pay attention to which other words. Instead of processing words one by one like a typewriter, it processes them all simultaneously like a reader whose eyes can jump to any part of the page.

At a glance

Inside a Transformer block (click to expand)
graph TD
    A[Token embeddings] --> B[Self-Attention]
    B --> C[Add and Normalize]
    C --> D[Feed-Forward Network]
    D --> E[Add and Normalize]
    E --> F[Output representations]

    style B fill:#4a9ede,color:#fff
    style D fill:#e8b84b,color:#fff
Key: Each Transformer block has two main components: a self-attention layer (where tokens interact with each other) and a feed-forward network (where each token is independently transformed). Modern LLMs stack dozens of these blocks on top of each other.

How does it work?

1. Self-attention — the core mechanism

Self-attention answers one question for every token: “Which other tokens in this sequence are relevant to me right now?”²

For each token, the model computes three vectors from the token’s embedding:¹

Vector	Role	Analogy
Query (Q)	“What am I looking for?”	A question the token asks
Key (K)	“What do I contain?”	A label describing the token
Value (V)	“What information do I offer?”	The actual content to pass along

The model computes an attention score between every pair of tokens by comparing one token’s Query against another token’s Key. High scores mean high relevance. These scores are passed through softmax to become weights that sum to 1.0. Each token’s output is then a weighted combination of all tokens’ Values, where the weights reflect relevance.²

graph LR
    A[Token: it] -->|Q vs K| B[Token: animal]
    A -->|Q vs K| C[Token: street]
    A -->|Q vs K| D[Token: tired]
    B -->|high score| E[Strong contribution]
    C -->|low score| F[Weak contribution]
    D -->|medium score| G[Moderate contribution]

    style A fill:#4a9ede,color:#fff
    style E fill:#5cb85c,color:#fff

Think of it like...

A spotlight on a dark stage. When the model processes the word “it,” it shines a spotlight on every other word in the sentence. “Animal” gets a bright beam (high attention). “Street” gets a dim flicker (low attention). The brightness is computed, not predetermined — the model learned which words to attend to during training.

2. Multi-head attention — multiple perspectives

The model does not run one attention calculation. It runs many in parallel — typically 32 to 128 — called attention heads. Each head has its own set of Q, K, V weight matrices and can learn to focus on different kinds of relationships:³

One head might track grammatical roles (subject, verb, object)
Another might track semantic similarity (animal and tired are conceptually linked)
Another might track positional proximity (nearby words influence each other)

The outputs of all heads are concatenated and combined through a linear transformation. This gives the model a rich, multi-angle understanding of how every token relates to every other token.³

Example: different heads attend to different things (click to expand)

For the sentence “The cat sat on the mat because it was comfortable”:

Head What it attends to Connection
Head 1 ”it” → “cat” Pronoun resolution
Head 3 ”sat” → “on” → “mat” Spatial relationship
Head 7 ”comfortable” → “mat” Property attribution
Head 12 ”cat” → “sat” Subject-verb agreement

No single head captures the full meaning. Together, they build a comprehensive representation.

Head	What it attends to	Connection
Head 1	”it” → “cat”	Pronoun resolution
Head 3	”sat” → “on” → “mat”	Spatial relationship
Head 7	”comfortable” → “mat”	Property attribution
Head 12	”cat” → “sat”	Subject-verb agreement

3. Feed-forward networks — processing each token

After self-attention mixes information between tokens, a feed-forward network (FFN) processes each token independently. This is where much of the model’s factual knowledge is stored — the FFN layers act as a compressed database of patterns learned during training.⁴

The FFN is a simple structure: two matrix multiplications with a non-linear activation function in between. But repeated across dozens of layers, these transformations gradually refine each token’s representation from a shallow embedding into something rich enough to predict the next token accurately.⁴

4. Stacking layers — depth creates understanding

A modern LLM is not one Transformer block. It is dozens stacked on top of each other:²

Model	Layers	Parameters
GPT-2	12	117M
GPT-3	96	175B
Llama 3 70B	80	70B

Each layer refines the token representations further. Early layers capture surface patterns (syntax, common phrases). Middle layers capture semantic relationships (meaning, context). Late layers capture task-specific patterns (answering questions, following instructions).⁴

graph TD
    A[Token embeddings] --> B[Layer 1: surface patterns]
    B --> C[Layer 20: syntax]
    C --> D[Layer 40: semantics]
    D --> E[Layer 60: reasoning]
    E --> F[Layer 80: task-specific]
    F --> G[Output probabilities]

    style A fill:#e8b84b,color:#fff
    style G fill:#5cb85c,color:#fff

5. Positional encoding — where tokens are

Self-attention has no built-in notion of order. If you shuffled all the tokens, the attention scores would be the same (the mechanism only compares Q and K vectors, not positions). To preserve word order, Transformers add positional encodings — vectors that encode each token’s position in the sequence — to the input embeddings.¹

Modern models use rotary positional embeddings (RoPE), which encode relative distances between tokens rather than absolute positions. This allows models to generalise to sequence lengths longer than they were trained on.⁵

Why do we use it?

Key reasons

1. Parallel processing. Unlike sequential models, Transformers process all tokens simultaneously during training. This makes them dramatically faster to train on modern GPU hardware.¹

2. Long-range connections. Self-attention lets any token attend to any other token, regardless of distance. The word “it” at position 50 can directly access the word “animal” at position 3, with no information loss from intervening tokens.²

3. Scalability. Transformers scale predictably: more layers, more heads, more parameters consistently produce better results. This scaling property is why the AI field has converged on this architecture.⁴

When do we use it?

When building or choosing a language model — every major LLM uses this architecture
When understanding model capabilities — knowing about attention helps explain why models handle context well but can still miss connections in very long texts
When interpreting model behaviour — attention patterns can be visualised to see what the model is “looking at” when generating each token
When understanding compute costs — self-attention scales quadratically with sequence length, which is why longer context windows are more expensive

Rule of thumb

If you are interacting with a modern LLM, you are using a Transformer. Understanding the architecture helps you reason about what the model can and cannot do — especially around long contexts, word order, and ambiguity resolution.

How can I think about it?

The conference room discussion

Imagine a meeting where every participant can hear every other participant simultaneously (not taking turns).

Each person (token) has something to contribute (their Value)

Each person asks a question (their Query): “Who has information relevant to me?”

Each person holds up a card (their Key) describing what they know

Attention scores = how closely each person’s card matches each question

The discussion = everyone absorbs relevant information from everyone else, in parallel

Multiple discussion groups (heads) happen simultaneously, each focusing on a different topic

After the meeting, each person has a richer understanding than they started with — this is the output representation

The web of string

Picture a sentence written on a wall, with each word on its own card. Now stretch coloured strings between the cards.

Thick strings connect words that are highly relevant to each other (“cat” ↔ “sat”, “it” ↔ “animal”)

Thin strings connect words with weak relevance (“the” ↔ “tired”)

Different colours represent different attention heads — red strings for grammar, blue for meaning, green for proximity

The pattern of strings is not fixed; it is recomputed at every layer, becoming more refined as the model goes deeper

The final pattern encodes everything the model understands about the sentence

Concepts to explore next

Concept	What it covers	Status
next-token-prediction	What the Transformer’s output layer produces: probability distributions over the vocabulary	complete
pre-training	How the Transformer learns its weights by predicting the next token over billions of text pages	complete
embeddings	The vector representations that self-attention operates on	complete

Check your understanding

Test yourself (click to expand)

Explain the self-attention mechanism in your own words. What problem does it solve that sequential processing cannot?

Describe the roles of Query, Key, and Value vectors. How do they work together to compute attention scores?

Distinguish between single-head and multi-head attention. Why does the model benefit from running multiple attention calculations in parallel?

Interpret this observation: a Transformer with 80 layers performs better than the same architecture with 12 layers, even on simple tasks. What does this tell you about what deeper layers contribute?

Connect the Transformer architecture to the concept of context windows. Why does self-attention become computationally expensive as the input sequence gets longer?

Where this concept fits

Position in the knowledge graph
graph TD
    AI[AI and Machine Learning] --> TRA[Transformer Architecture]
    AI --> TOK[Tokenization]
    AI --> EMB[Embeddings]
    AI --> NTP[Next-Token Prediction]
    AI --> PT[Pre-Training]
    TOK -.->|feeds| TRA
    EMB -.->|feeds| TRA
    TRA -.->|computes| NTP
    style TRA fill:#4a9ede,color:#fff
Related concepts:

next-token-prediction — the Transformer’s output feeds into the next-token prediction process; the architecture computes the probability distribution

pre-training — the Transformer’s weights are learned during pre-training through billions of next-token predictions

llm-pipelines — in production systems, the Transformer is one component within a larger pipeline of processing stages

Explorer

Transformer Architecture

Transformer Architecture

What is it?

At a glance

How does it work?

1. Self-attention — the core mechanism

2. Multi-head attention — multiple perspectives

3. Feed-forward networks — processing each token

4. Stacking layers — depth creates understanding

5. Positional encoding — where tokens are

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Graph View

Table of Contents

Backlinks

Explorer

Transformer Architecture

Transformer Architecture

What is it?

At a glance

How does it work?

1. Self-attention — the core mechanism

2. Multi-head attention — multiple perspectives

3. Feed-forward networks — processing each token

4. Stacking layers — depth creates understanding

5. Positional encoding — where tokens are

Why do we use it?

When do we use it?

How can I think about it?

Concepts to explore next

Check your understanding

Where this concept fits

Sources

Further reading

Footnotes

Graph View

Table of Contents

Backlinks