Transformer Architecture
The neural network architecture behind modern language models, built around a mechanism called self-attention that allows every token to consider every other token when computing its meaning.
What is it?
Before 2017, language models processed text sequentially — one word at a time, left to right, like reading a sentence through a keyhole that slides one word forward per step. This made them slow and forgetful. By the time the model reached the end of a long sentence, it had a faded memory of the beginning.1
The Transformer, introduced in the paper “Attention Is All You Need” by Vaswani et al. at Google, replaced this sequential approach with a fundamentally different idea: let every token look at every other token, all at once. Instead of reading through a keyhole, the model sees the entire sentence simultaneously and decides which words are relevant to which.1
This mechanism — called self-attention — is the core innovation. It allows the model to connect distant words (“The animal didn’t cross the street because it was too tired” — what does “it” refer to?), handle long-range dependencies, and process all tokens in parallel during training. Every major language model today — GPT, Claude, Llama, Gemini — is a Transformer.2
In plain terms
A Transformer is a neural network that reads an entire text at once and figures out which words should pay attention to which other words. Instead of processing words one by one like a typewriter, it processes them all simultaneously like a reader whose eyes can jump to any part of the page.
At a glance
Inside a Transformer block (click to expand)
graph TD A[Token embeddings] --> B[Self-Attention] B --> C[Add and Normalize] C --> D[Feed-Forward Network] D --> E[Add and Normalize] E --> F[Output representations] style B fill:#4a9ede,color:#fff style D fill:#e8b84b,color:#fffKey: Each Transformer block has two main components: a self-attention layer (where tokens interact with each other) and a feed-forward network (where each token is independently transformed). Modern LLMs stack dozens of these blocks on top of each other.
How does it work?
1. Self-attention — the core mechanism
Self-attention answers one question for every token: “Which other tokens in this sequence are relevant to me right now?”2
For each token, the model computes three vectors from the token’s embedding:1
| Vector | Role | Analogy |
|---|---|---|
| Query (Q) | “What am I looking for?” | A question the token asks |
| Key (K) | “What do I contain?” | A label describing the token |
| Value (V) | “What information do I offer?” | The actual content to pass along |
The model computes an attention score between every pair of tokens by comparing one token’s Query against another token’s Key. High scores mean high relevance. These scores are passed through softmax to become weights that sum to 1.0. Each token’s output is then a weighted combination of all tokens’ Values, where the weights reflect relevance.2
graph LR A[Token: it] -->|Q vs K| B[Token: animal] A -->|Q vs K| C[Token: street] A -->|Q vs K| D[Token: tired] B -->|high score| E[Strong contribution] C -->|low score| F[Weak contribution] D -->|medium score| G[Moderate contribution] style A fill:#4a9ede,color:#fff style E fill:#5cb85c,color:#fff
Think of it like...
A spotlight on a dark stage. When the model processes the word “it,” it shines a spotlight on every other word in the sentence. “Animal” gets a bright beam (high attention). “Street” gets a dim flicker (low attention). The brightness is computed, not predetermined — the model learned which words to attend to during training.
2. Multi-head attention — multiple perspectives
The model does not run one attention calculation. It runs many in parallel — typically 32 to 128 — called attention heads. Each head has its own set of Q, K, V weight matrices and can learn to focus on different kinds of relationships:3
- One head might track grammatical roles (subject, verb, object)
- Another might track semantic similarity (animal and tired are conceptually linked)
- Another might track positional proximity (nearby words influence each other)
The outputs of all heads are concatenated and combined through a linear transformation. This gives the model a rich, multi-angle understanding of how every token relates to every other token.3
Example: different heads attend to different things (click to expand)
For the sentence “The cat sat on the mat because it was comfortable”:
Head What it attends to Connection Head 1 ”it” → “cat” Pronoun resolution Head 3 ”sat” → “on” → “mat” Spatial relationship Head 7 ”comfortable” → “mat” Property attribution Head 12 ”cat” → “sat” Subject-verb agreement No single head captures the full meaning. Together, they build a comprehensive representation.
3. Feed-forward networks — processing each token
After self-attention mixes information between tokens, a feed-forward network (FFN) processes each token independently. This is where much of the model’s factual knowledge is stored — the FFN layers act as a compressed database of patterns learned during training.4
The FFN is a simple structure: two matrix multiplications with a non-linear activation function in between. But repeated across dozens of layers, these transformations gradually refine each token’s representation from a shallow embedding into something rich enough to predict the next token accurately.4
4. Stacking layers — depth creates understanding
A modern LLM is not one Transformer block. It is dozens stacked on top of each other:2
| Model | Layers | Parameters |
|---|---|---|
| GPT-2 | 12 | 117M |
| GPT-3 | 96 | 175B |
| Llama 3 70B | 80 | 70B |
Each layer refines the token representations further. Early layers capture surface patterns (syntax, common phrases). Middle layers capture semantic relationships (meaning, context). Late layers capture task-specific patterns (answering questions, following instructions).4
graph TD A[Token embeddings] --> B[Layer 1: surface patterns] B --> C[Layer 20: syntax] C --> D[Layer 40: semantics] D --> E[Layer 60: reasoning] E --> F[Layer 80: task-specific] F --> G[Output probabilities] style A fill:#e8b84b,color:#fff style G fill:#5cb85c,color:#fff
5. Positional encoding — where tokens are
Self-attention has no built-in notion of order. If you shuffled all the tokens, the attention scores would be the same (the mechanism only compares Q and K vectors, not positions). To preserve word order, Transformers add positional encodings — vectors that encode each token’s position in the sequence — to the input embeddings.1
Modern models use rotary positional embeddings (RoPE), which encode relative distances between tokens rather than absolute positions. This allows models to generalise to sequence lengths longer than they were trained on.5
Why do we use it?
Key reasons
1. Parallel processing. Unlike sequential models, Transformers process all tokens simultaneously during training. This makes them dramatically faster to train on modern GPU hardware.1
2. Long-range connections. Self-attention lets any token attend to any other token, regardless of distance. The word “it” at position 50 can directly access the word “animal” at position 3, with no information loss from intervening tokens.2
3. Scalability. Transformers scale predictably: more layers, more heads, more parameters consistently produce better results. This scaling property is why the AI field has converged on this architecture.4
When do we use it?
- When building or choosing a language model — every major LLM uses this architecture
- When understanding model capabilities — knowing about attention helps explain why models handle context well but can still miss connections in very long texts
- When interpreting model behaviour — attention patterns can be visualised to see what the model is “looking at” when generating each token
- When understanding compute costs — self-attention scales quadratically with sequence length, which is why longer context windows are more expensive
Rule of thumb
If you are interacting with a modern LLM, you are using a Transformer. Understanding the architecture helps you reason about what the model can and cannot do — especially around long contexts, word order, and ambiguity resolution.
How can I think about it?
The conference room discussion
Imagine a meeting where every participant can hear every other participant simultaneously (not taking turns).
- Each person (token) has something to contribute (their Value)
- Each person asks a question (their Query): “Who has information relevant to me?”
- Each person holds up a card (their Key) describing what they know
- Attention scores = how closely each person’s card matches each question
- The discussion = everyone absorbs relevant information from everyone else, in parallel
- Multiple discussion groups (heads) happen simultaneously, each focusing on a different topic
- After the meeting, each person has a richer understanding than they started with — this is the output representation
The web of string
Picture a sentence written on a wall, with each word on its own card. Now stretch coloured strings between the cards.
- Thick strings connect words that are highly relevant to each other (“cat” ↔ “sat”, “it” ↔ “animal”)
- Thin strings connect words with weak relevance (“the” ↔ “tired”)
- Different colours represent different attention heads — red strings for grammar, blue for meaning, green for proximity
- The pattern of strings is not fixed; it is recomputed at every layer, becoming more refined as the model goes deeper
- The final pattern encodes everything the model understands about the sentence
Concepts to explore next
| Concept | What it covers | Status |
|---|---|---|
| next-token-prediction | What the Transformer’s output layer produces: probability distributions over the vocabulary | complete |
| pre-training | How the Transformer learns its weights by predicting the next token over billions of text pages | complete |
| embeddings | The vector representations that self-attention operates on | complete |
Check your understanding
Test yourself (click to expand)
- Explain the self-attention mechanism in your own words. What problem does it solve that sequential processing cannot?
- Describe the roles of Query, Key, and Value vectors. How do they work together to compute attention scores?
- Distinguish between single-head and multi-head attention. Why does the model benefit from running multiple attention calculations in parallel?
- Interpret this observation: a Transformer with 80 layers performs better than the same architecture with 12 layers, even on simple tasks. What does this tell you about what deeper layers contribute?
- Connect the Transformer architecture to the concept of context windows. Why does self-attention become computationally expensive as the input sequence gets longer?
Where this concept fits
Position in the knowledge graph
graph TD AI[AI and Machine Learning] --> TRA[Transformer Architecture] AI --> TOK[Tokenization] AI --> EMB[Embeddings] AI --> NTP[Next-Token Prediction] AI --> PT[Pre-Training] TOK -.->|feeds| TRA EMB -.->|feeds| TRA TRA -.->|computes| NTP style TRA fill:#4a9ede,color:#fffRelated concepts:
- next-token-prediction — the Transformer’s output feeds into the next-token prediction process; the architecture computes the probability distribution
- pre-training — the Transformer’s weights are learned during pre-training through billions of next-token predictions
- llm-pipelines — in production systems, the Transformer is one component within a larger pipeline of processing stages
Sources
Further reading
Resources
- The Illustrated Transformer (Jay Alammar) — The most widely cited visual explanation of the Transformer, with step-by-step diagrams of self-attention
- Transformer Explainer (Georgia Tech / Polo Club) — Interactive browser-based visualisation that lets you step through a Transformer processing text
- Attention in Transformers, Visually Explained (3Blue1Brown) — Grant Sanderson’s video walkthrough with strong geometric intuitions
- Attention Is All You Need (Vaswani et al., 2017) — The original paper that introduced the Transformer architecture
- Transformer Architecture Explained (Codecademy) — Beginner-friendly reference with clear diagrams
Footnotes
-
Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. ↩ ↩2 ↩3 ↩4 ↩5
-
Alammar, J. (2018). The Illustrated Transformer. Jay Alammar. ↩ ↩2 ↩3 ↩4 ↩5
-
Codecademy. (2025). Transformer Architecture Explained With Self-Attention Mechanism. Codecademy. ↩ ↩2
-
Starmorph. (2026). How Large Language Models Work: The Complete Technical Guide. Starmorph Blog. ↩ ↩2 ↩3 ↩4
-
Infante, R. (2025). An Intuitive Overview of the Transformer Architecture. Medium. ↩
