The Two-Sigma Machine: How to Build AI Systems That Actually Teach
The same LLM that destroys learning when used as an answer machine can double learning gains when used as a Socratic coach. The difference is not the model. It is the harness. After reading this, you will understand why most AI tutors fail, what the systems that work have in common, and how to design one yourself.
Who this is for
You work at the intersection of AI and learning — or you want to. You may be an instructional designer watching AI rewrite your field, an AI practitioner who suspects that slapping a teaching prompt on a chatbot is not enough, or someone who sees that the space between “edtech” and “actual learning” is wide open.
This path is for you if:
- You have used AI to learn something and wondered whether it actually stuck
- You want to move from using AI to learn to designing AI systems that teach
- You are curious about the emerging role at the intersection of learning science, AI system design, and communication
- You want to understand why a $9.58 billion market has almost no one who can bridge all three
What this path is NOT
This is not a tutorial on fine-tuning LLMs or building chatbots. It is a design philosophy for learning systems, grounded in evidence, that happens to use AI as the delivery mechanism.
Part 1 — The promise nobody kept
In 1984, Benjamin Bloom published a finding that should have changed education forever.1 Students tutored one-to-one with mastery learning performed two standard deviations above their conventionally taught peers. The average tutored student outperformed 98% of the control class.
Bloom called it “the 2 Sigma Problem” — not because the finding was uncertain, but because it was economically impossible. One-to-one tutoring for every student would bankrupt any society. The problem was never “does tutoring work?” It was: how do we deliver this without a human tutor for every learner?
For forty years, nobody solved it. Programmed instruction tried in the 1960s and stalled. Computer-based training tried in the 1980s and felt robotic. Adaptive learning platforms tried in the 2010s and optimised for engagement metrics, not learning.
Then, in 2022, LLMs arrived.
graph LR A[Bloom 1984<br/>2-sigma finding] -->|40 years| B[No scalable<br/>delivery mechanism] B -->|2022| C[LLMs as potential<br/>tutor substrate] C -->|2025| D{Does it<br/>actually work?} style A fill:#4a9ede,color:#fff style D fill:#f0ad4e,color:#fff
A word of honesty: later replications found the original 2-sigma effect was likely overstated — more rigorous studies place the effect closer to 0.5 standard deviations.2 But even half a sigma, delivered to millions instead of one at a time, would be transformative. And in 2025, a Harvard randomised controlled trial showed an AI tutor doubling learning gains versus active-learning classrooms.3
The machine might finally exist. But only if you build it right.
Why this matters for you
If you are designing AI systems that touch learning — tutoring, onboarding, knowledge management, content — you need to understand the 2-sigma promise, why forty years of attempts failed, and what the successful systems have in common.
Part 2 — The five mechanisms that produce learning
Before designing a system that teaches, you need to know what learning actually is. Not the folk version (“I absorbed the information”). The mechanical version — what has to happen in the brain for knowledge to stick.
Five mechanisms have the strongest evidence base. A system that engages at least three produces durable learning; a system that engages all five is rare and powerful.
graph TD A[Retrieval<br/>practice] --> F[Durable<br/>learning] B[Spaced<br/>repetition] --> F C[Desirable<br/>difficulty] --> F D[Generation<br/>not reception] --> F E[Managed<br/>cognitive load] --> F style F fill:#5cb85c,color:#fff
Retrieval practice. Testing beats re-reading. Karpicke and Roediger (2008) demonstrated that students who practised retrieving information remembered significantly more than those who re-studied — even when the re-studiers spent more time.4 Learning happens not when information goes in, but when you force it back out.
Spaced repetition. Ebbinghaus showed in 1885 that 67% of learned material is forgotten within 24 hours without review.5 A meta-analysis of 254 studies confirmed: distributing practice across time produces 10–30% better recall across all study types and age groups.6
desirable-difficulties. Robert Bjork’s counterintuitive finding: making learning harder in the right ways — spacing, interleaving, varying conditions — improves long-term retention at the cost of short-term performance.7 The difficulty is the signal that learning is happening.
The generation-effect. Producing an answer is better than receiving one. Slamecka and Graf (1978) showed that generated words are remembered better than read words across recall and recognition, with an average effect size of d = 0.40 across 86 studies.8 This is the most important principle for AI tutor design — and the one most AI tutors violate.
cognitive-load-theory. Sweller (1988) identified three types: intrinsic load (the inherent complexity of the material), extraneous load (noise from bad presentation), and germane load (the work of building schemas).9 Good instruction reduces extraneous load so germane load can operate. Bad AI tutors increase extraneous load by dumping unstructured information.
Go deeper
If you want the full evidence base for these mechanisms from a learner’s perspective, read ai-self-learning. This path focuses on how to engineer them into a system.
Part 3 — Why most AI tutors make learners worse
This is the mindset shift. Most people building AI tutoring systems assume a better answer engine produces better learning. The opposite is true.
The key insight
An AI that gives excellent answers is a learning-prevention machine. It satisfies curiosity without producing memory. The learner feels fluent and retains nothing.
The evidence is stark. In 2025, a University of Pennsylvania study published in PNAS gave ~1,000 high school students three conditions: raw ChatGPT, a guardrailed AI tutor (hints only), and no AI.10 The ChatGPT group solved 48% more practice problems but scored 17% lower on the test. They outsourced thinking and mistook the AI’s fluency for their own.
A separate RCT found students using ChatGPT for study retained 57.5% versus 68.5% for traditional study.11 The mechanism is cognitive-offloading — when the AI carries the cognitive work, the brain doesn’t encode it. Karpicke and Blunt (2011) call the resulting confidence the “illusion of competence”: perceived learning diverges from actual learning.12
graph LR A[Learner asks<br/>question] --> B[AI gives<br/>excellent answer] B --> C[Learner feels<br/>they understand] C --> D[No retrieval<br/>no generation<br/>no encoding] D --> E[Knowledge<br/>evaporates] style E fill:#e74c3c,color:#fff
But the same LLM, constrained to ask rather than answer, produced the opposite result. The Harvard RCT tutor doubled learning gains by refusing to give answers and guiding students through Socratic questioning instead.3
The difference is not the model. It is the harness.
The design question
Every AI learning system faces the same fork: optimise for the learner’s comfort (give them the answer) or for the learner’s cognition (make them produce the answer). The systems that work choose cognition. Every time.
Part 4 — The harness view: a tutor is an agentic system
A learning system is not a chatbot with a teaching prompt. It is a harness — a context-engineered agentic system with pedagogical constraints built into its architecture, not its prompt.
The systems that work — Carnegie Learning’s MATHia (built at CMU, the gold standard for intelligent-tutoring-systems)13, the Harvard RCT tutor, Duolingo’s Birdbrain engine14 — share the same architecture, designed decades apart:
graph TD A[Learner model<br/>what do they know?] --> E[Orchestrator] B[Domain model<br/>what must they learn?] --> E C[Pedagogical strategy<br/>how should we teach?] --> E E --> F[Interaction<br/>loop] F -->|feedback| A style E fill:#4a9ede,color:#fff
The learner model tracks what the student knows, what they don’t, and where they are struggling. Duolingo’s Birdbrain does this continuously — when a learner gets an exercise wrong, it adjusts both the learner ability estimate and the exercise difficulty estimate in real time.14 This is context-engineering applied to cognition.
The domain model maps the knowledge space: what concepts exist, how they relate, what order they should be learned. If you have built knowledge-graphs or taxonomies, you already understand this component.
The pedagogical strategy decides what to do next — not “what to say” (that’s a prompt) but “should we test or teach? give a hint or wait? increase difficulty or review?” This is orchestration — the same pattern used in agentic-systems, applied to a different loop.
The interaction loop is the agentic cycle: observe the learner’s response, update the model, choose the next action, deliver it, observe again. The same observe-plan-act-evaluate loop you find in agentic design, purpose-built for one objective: produce durable learning.
Go deeper
This architecture maps directly to context-engineering, harness-engineering, and orchestration. The difference is the objective function: not task completion, but verified learning.
Part 5 — Evaluation: did they actually learn?
The hardest part of building a learning system is knowing whether it worked. This is the equivalent of evals in agentic design (see evaluator-optimiser) — and it is just as hard, because the thing you are measuring is invisible.
Donald Kirkpatrick proposed four levels of evaluation in 1959, and they remain the standard framework sixty years later15:
graph TD A[Level 1 Reaction<br/>did they like it?] --> B[Level 2 Learning<br/>did they learn it?] B --> C[Level 3 Behaviour<br/>do they apply it?] C --> D[Level 4 Results<br/>did it produce outcomes?] style A fill:#fde8e8,stroke:#e74c3c style B fill:#fff3cd,stroke:#f0ad4e style C fill:#d4edda,stroke:#5cb85c style D fill:#4a9ede,color:#fff
Most AI tutoring products measure Level 1 (engagement, time-on-task, satisfaction). Some measure Level 2 (quiz scores). Almost none measure Level 3 or 4 — whether the learner uses the knowledge in new contexts weeks later.
The Bastani et al. study10 illustrates the trap: the ChatGPT group had excellent practice performance (Level 1–2 by proxy) but terrible test performance (Level 2, properly measured). The metric told the wrong story.
To evaluate a learning system properly:
- Pre/post testing — measure knowledge before and after
- Delayed retention — test again days or weeks later, not immediately
- Transfer tasks — can they apply it in a novel context?
- Bloom’s depth axis — are they remembering, or can they analyse, evaluate, create?16
The revised Bloom’s taxonomy (Anderson & Krathwohl, 2001) gives you the evaluation verbs: Remember, Understand, Apply, Analyse, Evaluate, Create.16 A system that only produces “remember” is a flashcard app. A system that produces “create” is a tutor.
The eval question for your system
If you can’t test for transfer — knowledge applied in a context the learner hasn’t seen — you don’t know if learning happened. You only know if memorisation happened.
Part 6 — Scaffolding and fade: the autonomy gradient
The last architectural principle: a good tutor makes itself unnecessary.
Wood, Bruner, and Ross introduced scaffolding in 1976 — not Vygotsky, though it operationalises his Zone of Proximal Development.17 The idea: provide temporary support calibrated to the learner’s current ability, then systematically withdraw it as competence grows. The scaffold fades.
graph LR A[High support<br/>guided steps] --> B[Reduced support<br/>hints only] B --> C[Minimal support<br/>verification only] C --> D[No support<br/>independent] style A fill:#4a9ede,color:#fff style D fill:#5cb85c,color:#fff
This is progressive autonomy applied to learning — the same pattern found in agentic-loops, but the autonomy belongs to the learner, not the agent. The system starts by doing most of the cognitive work (worked examples, guided steps, frequent feedback). As the learner demonstrates mastery, it withdraws: fewer hints, harder problems, longer intervals between feedback.
Carnegie Learning’s MATHia adjusts difficulty and support in real time based on knowledge tracing.13 Duolingo’s Birdbrain calibrates each review session to the learner’s forgetting curve.14 The Harvard RCT tutor adjusted its Socratic questioning based on student responses within a single interaction.3
The opposite pattern — a system that always gives the same level of support regardless of the learner’s growth — is the equivalent of an agent that never updates its context. It feels helpful. It prevents learning.
The design principle
If your learning system gives the same quality of help on day 100 as it did on day 1, it has failed. The measure of a good tutor is that the learner stops needing it.
Part 7 — The translator’s job
You now have the seven components of ai-native-learning system design: the 2-sigma promise, five learning mechanisms, the naive-tutor failure mode, the harness architecture, the evaluation ladder, and the scaffold-fade cycle.
The role that connects them all does not have a canonical name yet. IEEE ICICLE calls the adjacent discipline “learning engineering.”18 Philippa Hardman describes the evolution as instructional designers becoming “learning ecosystem architects” and proposes the ADGIE model as the AI-native successor to ADDIE.19
The AI-in-education market is 136.79B by 2035.20 There is no standardised degree, certification, or career path for this role. The opportunity belongs to whoever acts first.
graph TD A[Engineers] -->|need to understand| P[Pedagogy] B[Educators] -->|need to understand| T[AI architecture] C[Executives] -->|need to understand| R[Evidence + ROI] P --> D[The translator<br/>bridges all three] T --> D R --> D style D fill:#4a9ede,color:#fff
The translator holds three conversations:
- To engineers: why a chatbot is not a tutor, why the generation effect matters more than response quality, why learning evals differ from accuracy evals
- To educators: what a harness is, why AI-native design is not “bolt AI onto a worksheet,” what adaptive scaffolding looks like at scale
- To executives: that the Harvard study showed double the learning gains, that the market is growing 34.5% annually, that the EU AI Act applies to educational AI classified as high-risk
If you can hold all three conversations in a single meeting, you are the rarest person in the room.
What you now understand
Concepts you have gained
- bloom-two-sigma — the finding, its limits, and the forty-year search for a delivery mechanism
- Five learning mechanisms — retrieval, spacing, desirable difficulty, generation, cognitive load management
- cognitive-offloading — why answer machines prevent learning and the evidence that proves it
- intelligent-tutoring-systems — the four-component architecture (learner model, domain model, pedagogical strategy, interaction loop)
- kirkpatrick-model — four levels of evaluation and why most AI tutors measure the wrong one
- scaffolding — the autonomy gradient: high support that systematically fades as competence grows
- ai-native-learning — the emerging discipline of designing learning systems around AI capabilities from the ground up
Check your understanding
Test yourself before moving on
- Explain why an AI that gives excellent answers can produce worse learning outcomes than no AI at all
- Distinguish between the four components of a learning harness (learner model, domain model, pedagogical strategy, interaction loop) and describe what each one does
- Analyse the Bastani et al. study: why did the ChatGPT group solve more practice problems but score lower on the test? Which learning mechanisms were engaged and which were bypassed?
- Evaluate an AI tutoring product you have used against Kirkpatrick’s four levels — which levels does it actually measure?
- Design a scaffold-fade sequence for teaching a single concept in your domain: what does high support look like? What triggers each reduction? When does the system step back entirely?
Where to go next
Build a learning harness
You understand the principles. Now build. Start with the harness architecture from Part 4 — a learner model, a domain model, a pedagogical strategy, and an interaction loop. Use one of your own learning domains as the test-bed.
Follow agentic-design for the system thinking, then agentic-loops for the observe-plan-act-evaluate cycle applied to learning.
Best for: practitioners who want to ship a working prototype.
Deepen the learning science
Parts 2 and 3 gave you the mechanisms. If you want the full evidence base — paradigms, knowledge types, and how to design for different kinds of understanding — go to learning-science.
Best for: designers who want the theoretical depth to justify their architectural decisions.
Understand the market
Part 7 sketched the opportunity. If you want the full picture — roles, salaries, skill stacks, and where the demand is — read ai-career-landscape.
Best for: anyone considering a career move into this space.
Start translating
If Part 7 resonated and you want to build a visible voice in this intersection, start with content-strategy for the engine, then linkedin-ecosystem for the distribution channel.
Best for: translators who want to be known before they are hired.
Sources
Further reading
Resources
- Bloom, B.S. (1984). The 2 Sigma Problem — the original 8-page paper. Read this first; it is shorter than you think.
- Brown, P.C., Roediger, H.L. & McDaniel, M.A. (2014). Make It Stick: The Science of Successful Learning — the best single book on evidence-based learning, written for practitioners.
- Sweller, J., Ayres, P. & Kalyuga, S. (2011). Cognitive Load Theory — the comprehensive treatment. Dense but essential for learning system designers.
- Hardman, P. Dr. Philippa Hardman’s Substack — the most active practitioner voice at the AI × instructional design intersection.
- Carnegie Learning. The Science Behind MATHia — how the gold-standard intelligent tutoring system works under the hood.
- Bastani, H. et al. (2025). Generative AI Without Guardrails Can Harm Learning — the study that should be required reading for anyone building AI for education.
- IEEE ICICLE. Learning Engineering Resources — the professional body working to formalise the discipline.
- Podcast: Teaching in Higher Ed — interviews with researchers and practitioners at the learning × technology intersection.
Footnotes
-
Bloom, B.S. (1984). The 2 Sigma Problem: The Search for Methods of Group Instruction as Effective as One-to-One Tutoring. Educational Researcher, 13(6), 4–16. ↩
-
VanderWeele, T.J. & Ricon, J.L. (2019–2021). On Bloom’s Two Sigma Problem: A Systematic Review. Nintil / Two Sigma Tutoring: Separating Science Fiction from Science Fact. Education Next. ↩
-
Kestin, G., Miller, K., Klales, A. et al. (2025). AI Tutoring Outperforms In-Class Active Learning. Scientific Reports, 15, 17458. ↩ ↩2 ↩3
-
Karpicke, J.D. & Roediger, H.L. III (2008). The Critical Importance of Retrieval for Learning. Science, 319(5865), 966–968. ↩
-
Ebbinghaus, H. (1885). Uber das Gedachtnis (English: Memory: A Contribution to Experimental Psychology). ↩
-
Cepeda, N.J. et al. (2006). Distributed Practice in Verbal Recall Tasks: A Review and Quantitative Synthesis. Psychological Bulletin, 132(3), 354–380. ↩
-
Bjork, R.A. (1994). Memory and Metamemory Considerations in the Training of Human Beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition: Knowing About Knowing (pp. 185–205). MIT Press. ↩
-
Slamecka, N.J. & Graf, P. (1978). The Generation Effect: Delineation of a Phenomenon. Journal of Experimental Psychology: Human Learning and Memory, 4(6), 592–604. ↩
-
Sweller, J. (1988). Cognitive Load During Problem Solving: Effects on Learning. Cognitive Science, 12(2), 257–285. ↩
-
Bastani, H., Bastani, O., Sungu, A. et al. (2025). Generative AI Without Guardrails Can Harm Learning. PNAS, 122(26). ↩ ↩2
-
Barcaui, A. (2025). ChatGPT as a Cognitive Crutch: Evidence from a Randomized Controlled Trial on Knowledge Retention. SSRN. ↩
-
Karpicke, J.D. & Blunt, J.R. (2011). Retrieval Practice Produces More Learning than Elaborative Studying with Concept Mapping. Science, 331(6018), 772–775. ↩
-
Carnegie Learning. MATHia: AI-Powered Math Learning. Founded at CMU by Anderson, Koedinger, Ritter & Hadley (1998). ↩ ↩2
-
Duolingo. Learning How to Help You Learn: Introducing Birdbrain. Duolingo Blog. ↩ ↩2 ↩3
-
Kirkpatrick, D. (1959; updated 1993). Evaluating Training Programs. Berrett-Koehler. Kirkpatrick Partners. ↩
-
Anderson, L.W. & Krathwohl, D.R. (Eds.) (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy. Longman. ↩ ↩2
-
Wood, D., Bruner, J.S. & Ross, G. (1976). The Role of Tutoring in Problem Solving. Journal of Child Psychology and Psychiatry, 17(2), 89–100. ↩
-
IEEE ICICLE. Industry Consortium for Innovation and Collaboration in Learning Engineering. IEEE Standards Association. ↩
-
Hardman, P. (2024–2025). AI Tutors Double Rates of Learning and various publications on ADGIE model. Substack. ↩
-
Precedence Research (2025). Artificial Intelligence in Education Market. ↩
