LLM Pipelines

A multi-step workflow where a language model receives input, processes it through a sequence of stages, and produces output — turning a single interaction into a structured chain of transformations.


What is it?

When most people first use a large language model, they type a single prompt and get a single response. That works for simple tasks — summarising a paragraph, answering a factual question, drafting a short email. But as soon as the task involves multiple phases (research, then analysis, then writing), multiple concerns (accuracy, tone, format), or multiple tools (search, database, code execution), a single prompt-response exchange starts to break down.1

An LLM pipeline is the alternative. Instead of asking the model to do everything in one shot, you decompose the task into discrete stages, where each stage has a focused job and passes its output to the next stage as input. The model runs multiple times — once per stage — with each run receiving only the context it needs for that particular step.2

This is not a new idea. Software engineering has used pipelines for decades (Unix pipes, ETL processes, CI/CD pipelines). The LLM version applies the same principle: break a complex process into small, composable, independently testable steps. What makes LLM pipelines distinctive is that the “processing” at each stage is done by a language model — reasoning, generating, evaluating, transforming — rather than by deterministic code alone.3

The parent concept, agentic-systems, frames LLM pipelines as one of two main building blocks (alongside orchestration). Where orchestration concerns how agents are coordinated and managed, pipelines concern the internal structure of how work flows through a system — the sequence of transformations that turns a request into a result.

In plain terms

A single prompt is like asking someone one question and expecting a perfect answer. A pipeline is like giving someone a checklist: first gather the facts, then analyse them, then write up the findings, then check for errors. Each step is simpler, and you can verify the result before moving on.


At a glance


How does it work?

LLM pipelines are built from a small set of patterns that can be combined. Anthropic’s research on building effective agents identifies these as the foundational workflow patterns that underpin most production systems.2

1. Prompt chaining — the sequential pattern

The simplest pipeline is a straight line: the output of one LLM call becomes the input of the next. Each step focuses on a single sub-task, and you can insert programmatic checks (gates) between steps to validate that the process stays on track.2

For example, a content creation pipeline might look like:

StepTaskOutput
1Research and extract key facts from source materialStructured fact list
2Generate an outline based on the factsDocument outline
3Gate: check that the outline covers all required topicsPass/fail decision
4Write the full draft from the outlineDraft document
5Review and edit for tone, accuracy, and styleFinal document

Each step is easier for the model than doing everything at once. And critically, each intermediate output is an inspectable artifact — if step 4 produces a bad draft, you can examine the outline from step 2 to see whether the problem originated there.4

Think of it like...

An assembly line in a factory. Each station has one job (cut, weld, paint, inspect). No station tries to build the entire product. The product gains value at each station, and quality checks between stations catch defects early — before more work is wasted on a flawed component.

Concept to explore

See context-cascading for how context is layered from general to specific across pipeline stages, ensuring each stage receives the right information without overloading the model’s context window.


2. Routing — the branching pattern

Not every input should follow the same path. A routing step classifies the input and directs it to a specialised downstream handler. This allows different types of requests to receive different treatment without one handler trying to be good at everything.2

For example: a customer service pipeline might classify incoming messages as billing questions, technical issues, or general inquiries, then route each to a stage with specialised instructions and tools for that category.

Routing can be implemented through rule-based logic (keyword matching, regex), LLM-based classification (asking the model to categorise the input), or a hybrid of both.5

Think of it like...

A hospital triage desk. The triage nurse does not treat patients — they assess the situation and direct each patient to the right department (emergency, outpatient, specialist). The nurse’s job is fast classification; the department’s job is specialised treatment.

Concept to explore

See prompt-routing for a deeper dive into how routing decisions are made and the trade-offs between rule-based and LLM-based classification.


3. Parallelisation — the divide-and-conquer pattern

When a task has independent sub-tasks, you can run them simultaneously rather than sequentially. This comes in two forms: sectioning (splitting the work into different sub-tasks that run in parallel) and voting (running the same task multiple times with different approaches to get diverse outputs).2

For example: when reviewing a document, one parallel branch could check for factual accuracy while another checks for tone and style. Neither depends on the other, so they run concurrently, and their results are aggregated at the end.


4. Evaluation and feedback loops — the refinement pattern

A powerful pipeline pattern uses one LLM call to generate output and another to evaluate it, forming a loop that iterates until the output meets quality criteria.2 This is sometimes called the evaluator-optimizer pattern.5

The evaluator provides specific, actionable feedback. The generator incorporates that feedback and produces a revised version. The cycle repeats until the evaluator passes the output or a maximum iteration count is reached.

This pattern is particularly valuable when quality criteria are clear and measurable — checking code against test cases, verifying translations against style guides, or validating factual claims against source material.

Key distinction

Gates are binary checkpoints between stages (pass/fail). Feedback loops are iterative — the evaluator provides detailed feedback that guides the next revision. Gates catch problems; feedback loops fix them.


5. Tool use — extending the pipeline beyond text

A pipeline stage is not limited to text-in, text-out. Stages can call external tools: APIs, databases, code interpreters, search engines, file systems. Tool use is what gives pipelines real-world impact — the ability to read, write, query, compute, and act, not just generate text.2

The pattern is straightforward: the model decides which tool to call, formats the input, receives the output, and incorporates it into its reasoning. This is the mechanism that connects LLM pipelines to the broader ecosystem of apis and services.

Concept to explore

See rag (Retrieval-Augmented Generation) for the most common tool-use pattern: retrieving relevant documents from a knowledge base before generating a response.


Why pipelines outperform monolithic prompts

Three structural reasons explain why splitting a task across multiple stages produces better results than a single large prompt:4

  1. Reduced cognitive load per step. Each stage asks the model to do one thing well, rather than juggling multiple concerns simultaneously. Research on LLM context use shows that models lose track of information embedded in long prompts — a phenomenon called the “lost-in-the-middle” effect.6

  2. Inspectable intermediate artifacts. If the final output is wrong, you can trace back through the stages to find where the error originated. In a single prompt, you get one output and no visibility into the reasoning path that produced it.4

  3. Independent optimisation. Each stage can be tuned separately: different prompts, different models (a smaller model for classification, a larger one for generation), different temperature settings, different retry policies. You cannot do this with a monolithic prompt.2


Why do we use it?

Key reasons

1. Reliability. A pipeline with validation gates catches errors at each stage, preventing mistakes from compounding. A single prompt has no checkpoints — if anything goes wrong, everything goes wrong.4

2. Debuggability. When the output is wrong, you can inspect each intermediate artifact to pinpoint where the problem originated. This turns “the AI gave a bad answer” into “stage 3 produced an incomplete analysis because the input from stage 2 was missing key data.”4

3. Cost and speed optimisation. Different stages can use different models. A lightweight model handles classification (fast, cheap); a powerful model handles generation (slower, more expensive). You pay for capability only where you need it.2

4. Composability. Pipeline stages are reusable building blocks. A “fact extraction” stage written for one pipeline can be reused in another. This is the same principle that makes Unix pipes powerful — small tools that do one thing well, composed into larger workflows.3


When do we use it?

  • When a task involves multiple distinct phases (research, analyse, generate, review)
  • When the task mixes different modes of work (extracting facts, making decisions, writing prose)
  • When correctness matters and you need validation checkpoints between steps
  • When you need traceability — the ability to audit how the output was produced
  • When the input is too large or complex for a single prompt to handle reliably
  • When you want to reuse stages across different workflows

Rule of thumb

If you can describe the task as a single, clear instruction (“summarise this paragraph”), a single prompt is fine. If you find yourself writing a prompt with multiple numbered steps, conditionals, or caveats, you are describing a pipeline — and you should build one.


How can I think about it?

The recipe analogy

A pipeline is like following a recipe with distinct preparation stages.

  • Mise en place (Stage 1 - Input preparation): Gather and prepare all ingredients before cooking. In a pipeline, this is data ingestion and context loading — assembling everything the model will need.
  • Prep work (Stage 2 - Extraction/transformation): Chop vegetables, marinate meat, measure spices. Each ingredient is prepared separately. In a pipeline, this is extracting facts, classifying input, or reformatting data.
  • Taste test (Gate): Check the seasoning before moving on. In a pipeline, this is a validation gate that verifies quality before the next stage.
  • Cooking (Stage 3 - Core generation): Combine prepared ingredients and apply heat. In a pipeline, this is the main generation step where the model produces the primary output.
  • Plating (Stage 4 - Post-processing): Arrange the dish for presentation. In a pipeline, this is formatting, polishing, and final quality checks.

A chef who tries to do all of this simultaneously — chopping while sauteing while plating — produces chaos. The stages exist because each requires different attention and tools, and the order matters.

The editorial desk analogy

A pipeline is like a newspaper’s editorial process.

  • Reporter (Stage 1): Gathers facts from sources and writes a raw draft. Focused on completeness, not polish.
  • Fact-checker (Stage 2): Verifies every claim against sources. Catches errors before they propagate. This is a validation gate.
  • Editor (Stage 3): Restructures, rewrites for clarity, enforces the publication’s style guide. Focused on quality, not gathering.
  • Copy editor (Stage 4): Catches grammar, spelling, formatting issues. Fine-grained polish.
  • Layout (Stage 5): Formats the final piece for publication. The template and output formatting stage.

No single person does all of these jobs simultaneously. Each role has specialised skills and a narrow focus. The newspaper’s quality comes from the pipeline, not from any individual genius — and if the fact-checker catches an error, it is fixed before the editor wastes time polishing a flawed article.


Concepts to explore next

ConceptWhat it coversStatus
prompt-chainingThe simplest pipeline pattern — a strict sequence of LLM calls with validation gatescomplete
prompt-routingHow systems classify input and direct it to specialised handlersstub
parallelisationRunning independent subtasks concurrently via sectioning or votingcomplete
evaluator-optimiserThe generate-evaluate-refine loop for iterative quality improvementcomplete
context-cascadingLayering context from general to specific across pipeline stagescomplete
ragRetrieving external knowledge to augment generationstub
structured-outputConstraining LLM responses to specific formats for reliable pipeline handoffscomplete

Some cards don't exist yet

A broken link is a placeholder for future learning, not an error.


Check your understanding


Where this concept fits

Position in the knowledge graph

graph TD
    AS[Agentic Systems] --> LP[LLM Pipelines]
    AS --> ORCH[Orchestration]
    LP --> PCH[Prompt Chaining]
    LP --> PR[Prompt Routing]
    LP --> PAR[Parallelisation]
    LP --> EO[Evaluator-Optimiser]
    LP --> CC[Context Cascading]
    LP --> RAG[RAG]
    LP --> SO[Structured Output]
    style LP fill:#4a9ede,color:#fff

Related concepts:

  • orchestration — while pipelines define the flow of work through stages, orchestration manages which agents run, when, and how they coordinate
  • machine-readable-formats — pipeline stages often pass structured data (JSON, YAML) between them, making machine-readable formats essential for reliable handoffs
  • apis — tool-use stages in a pipeline call external services through APIs, connecting the pipeline to the wider software ecosystem

Sources


Further reading

Resources

Footnotes

  1. PMDG Technologies. (2025). Building Smarter AI Systems: Multi-Stage LLM Pipelines Explained. PMDG Technologies.

  2. Schluntz, E. and Zhang, B. (2024). Building Effective Agents. Anthropic. 2 3 4 5 6 7 8 9

  3. The Thinking Company. (2026). How to Build an AI Agent Pipeline. The Thinking Company. 2

  4. TheLinuxCode. (2026). Prompt Chaining: Building Reliable Multi-Step LLM Workflows. TheLinuxCode. 2 3 4 5

  5. Carpintero, D. (2025). Design Patterns for Building Agentic Workflows. Hugging Face. 2

  6. Liu, N. F., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12.