Transformer Architecture

The neural network architecture behind modern language models, built around a mechanism called self-attention that allows every token to consider every other token when computing its meaning.


What is it?

Before 2017, language models processed text sequentially — one word at a time, left to right, like reading a sentence through a keyhole that slides one word forward per step. This made them slow and forgetful. By the time the model reached the end of a long sentence, it had a faded memory of the beginning.1

The Transformer, introduced in the paper “Attention Is All You Need” by Vaswani et al. at Google, replaced this sequential approach with a fundamentally different idea: let every token look at every other token, all at once. Instead of reading through a keyhole, the model sees the entire sentence simultaneously and decides which words are relevant to which.1

This mechanism — called self-attention — is the core innovation. It allows the model to connect distant words (“The animal didn’t cross the street because it was too tired” — what does “it” refer to?), handle long-range dependencies, and process all tokens in parallel during training. Every major language model today — GPT, Claude, Llama, Gemini — is a Transformer.2

In plain terms

A Transformer is a neural network that reads an entire text at once and figures out which words should pay attention to which other words. Instead of processing words one by one like a typewriter, it processes them all simultaneously like a reader whose eyes can jump to any part of the page.


At a glance


How does it work?

1. Self-attention — the core mechanism

Self-attention answers one question for every token: “Which other tokens in this sequence are relevant to me right now?”2

For each token, the model computes three vectors from the token’s embedding:1

VectorRoleAnalogy
Query (Q)“What am I looking for?”A question the token asks
Key (K)“What do I contain?”A label describing the token
Value (V)“What information do I offer?”The actual content to pass along

The model computes an attention score between every pair of tokens by comparing one token’s Query against another token’s Key. High scores mean high relevance. These scores are passed through softmax to become weights that sum to 1.0. Each token’s output is then a weighted combination of all tokens’ Values, where the weights reflect relevance.2

graph LR
    A[Token: it] -->|Q vs K| B[Token: animal]
    A -->|Q vs K| C[Token: street]
    A -->|Q vs K| D[Token: tired]
    B -->|high score| E[Strong contribution]
    C -->|low score| F[Weak contribution]
    D -->|medium score| G[Moderate contribution]

    style A fill:#4a9ede,color:#fff
    style E fill:#5cb85c,color:#fff

Think of it like...

A spotlight on a dark stage. When the model processes the word “it,” it shines a spotlight on every other word in the sentence. “Animal” gets a bright beam (high attention). “Street” gets a dim flicker (low attention). The brightness is computed, not predetermined — the model learned which words to attend to during training.

2. Multi-head attention — multiple perspectives

The model does not run one attention calculation. It runs many in parallel — typically 32 to 128 — called attention heads. Each head has its own set of Q, K, V weight matrices and can learn to focus on different kinds of relationships:3

  • One head might track grammatical roles (subject, verb, object)
  • Another might track semantic similarity (animal and tired are conceptually linked)
  • Another might track positional proximity (nearby words influence each other)

The outputs of all heads are concatenated and combined through a linear transformation. This gives the model a rich, multi-angle understanding of how every token relates to every other token.3

3. Feed-forward networks — processing each token

After self-attention mixes information between tokens, a feed-forward network (FFN) processes each token independently. This is where much of the model’s factual knowledge is stored — the FFN layers act as a compressed database of patterns learned during training.4

The FFN is a simple structure: two matrix multiplications with a non-linear activation function in between. But repeated across dozens of layers, these transformations gradually refine each token’s representation from a shallow embedding into something rich enough to predict the next token accurately.4

4. Stacking layers — depth creates understanding

A modern LLM is not one Transformer block. It is dozens stacked on top of each other:2

ModelLayersParameters
GPT-212117M
GPT-396175B
Llama 3 70B8070B

Each layer refines the token representations further. Early layers capture surface patterns (syntax, common phrases). Middle layers capture semantic relationships (meaning, context). Late layers capture task-specific patterns (answering questions, following instructions).4

graph TD
    A[Token embeddings] --> B[Layer 1: surface patterns]
    B --> C[Layer 20: syntax]
    C --> D[Layer 40: semantics]
    D --> E[Layer 60: reasoning]
    E --> F[Layer 80: task-specific]
    F --> G[Output probabilities]

    style A fill:#e8b84b,color:#fff
    style G fill:#5cb85c,color:#fff

5. Positional encoding — where tokens are

Self-attention has no built-in notion of order. If you shuffled all the tokens, the attention scores would be the same (the mechanism only compares Q and K vectors, not positions). To preserve word order, Transformers add positional encodings — vectors that encode each token’s position in the sequence — to the input embeddings.1

Modern models use rotary positional embeddings (RoPE), which encode relative distances between tokens rather than absolute positions. This allows models to generalise to sequence lengths longer than they were trained on.5


Why do we use it?

Key reasons

1. Parallel processing. Unlike sequential models, Transformers process all tokens simultaneously during training. This makes them dramatically faster to train on modern GPU hardware.1

2. Long-range connections. Self-attention lets any token attend to any other token, regardless of distance. The word “it” at position 50 can directly access the word “animal” at position 3, with no information loss from intervening tokens.2

3. Scalability. Transformers scale predictably: more layers, more heads, more parameters consistently produce better results. This scaling property is why the AI field has converged on this architecture.4


When do we use it?

  • When building or choosing a language model — every major LLM uses this architecture
  • When understanding model capabilities — knowing about attention helps explain why models handle context well but can still miss connections in very long texts
  • When interpreting model behaviour — attention patterns can be visualised to see what the model is “looking at” when generating each token
  • When understanding compute costs — self-attention scales quadratically with sequence length, which is why longer context windows are more expensive

Rule of thumb

If you are interacting with a modern LLM, you are using a Transformer. Understanding the architecture helps you reason about what the model can and cannot do — especially around long contexts, word order, and ambiguity resolution.


How can I think about it?

The conference room discussion

Imagine a meeting where every participant can hear every other participant simultaneously (not taking turns).

  • Each person (token) has something to contribute (their Value)
  • Each person asks a question (their Query): “Who has information relevant to me?”
  • Each person holds up a card (their Key) describing what they know
  • Attention scores = how closely each person’s card matches each question
  • The discussion = everyone absorbs relevant information from everyone else, in parallel
  • Multiple discussion groups (heads) happen simultaneously, each focusing on a different topic
  • After the meeting, each person has a richer understanding than they started with — this is the output representation

The web of string

Picture a sentence written on a wall, with each word on its own card. Now stretch coloured strings between the cards.

  • Thick strings connect words that are highly relevant to each other (“cat” ↔ “sat”, “it” ↔ “animal”)
  • Thin strings connect words with weak relevance (“the” ↔ “tired”)
  • Different colours represent different attention heads — red strings for grammar, blue for meaning, green for proximity
  • The pattern of strings is not fixed; it is recomputed at every layer, becoming more refined as the model goes deeper
  • The final pattern encodes everything the model understands about the sentence

Concepts to explore next

ConceptWhat it coversStatus
next-token-predictionWhat the Transformer’s output layer produces: probability distributions over the vocabularycomplete
pre-trainingHow the Transformer learns its weights by predicting the next token over billions of text pagescomplete
embeddingsThe vector representations that self-attention operates oncomplete

Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AI[AI and Machine Learning] --> TRA[Transformer Architecture]
    AI --> TOK[Tokenization]
    AI --> EMB[Embeddings]
    AI --> NTP[Next-Token Prediction]
    AI --> PT[Pre-Training]
    TOK -.->|feeds| TRA
    EMB -.->|feeds| TRA
    TRA -.->|computes| NTP
    style TRA fill:#4a9ede,color:#fff

Related concepts:

  • next-token-prediction — the Transformer’s output feeds into the next-token prediction process; the architecture computes the probability distribution
  • pre-training — the Transformer’s weights are learned during pre-training through billions of next-token predictions
  • llm-pipelines — in production systems, the Transformer is one component within a larger pipeline of processing stages

Sources


Further reading

Resources

Footnotes

  1. Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017. 2 3 4 5

  2. Alammar, J. (2018). The Illustrated Transformer. Jay Alammar. 2 3 4 5

  3. Codecademy. (2025). Transformer Architecture Explained With Self-Attention Mechanism. Codecademy. 2

  4. Starmorph. (2026). How Large Language Models Work: The Complete Technical Guide. Starmorph Blog. 2 3 4

  5. Infante, R. (2025). An Intuitive Overview of the Transformer Architecture. Medium.