How to Think About AI in 2026: From Automation to Reasoning

The conversation about AI is shifting from wonder to something quieter and more uncertain. This path gives you a frame for where the technology actually is, what to do about hallucinations, why dependency is the right concern, and the single reframe that changes how you should use it.


Who this is for

You have used ChatGPT or Claude for a year or two. You went through a phase of awe. You now find yourself slightly more skeptical, slightly more careful, and unsure whether the skepticism is wisdom or premature dismissal. You have noticed that the public mood has shifted in the same direction, and you want to know whether the data supports the new mood.

This path is for you if:

  • You teach, lead, or coach others on AI and feel the room cooling
  • You hear claims that “AI is plateauing” and want to know if they hold up
  • You suspect that something important is happening to the way people think when they use AI heavily, and you want to know what the research actually shows
  • You want a sharper mental model than “AI is good” or “AI is bad” — one that helps you decide how to use it

What this path is NOT

Not a tool round-up. Not a self-learning manual (for that, see ai-self-learning). This path is about how to think about the technology itself in 2026 — the empirical state, the debates, and the reframe that the evidence supports.


Part 1 — The mood shift

Something has changed in the room.

Workshops that opened in 2023 with audible enthusiasm now open with cautious questions. The dominant emotion has moved from wonder to a more specific kind of concern. The numbers back this up.

Pew’s September 2025 survey of US adults found that 50% are now more concerned than excited about AI in daily life, up from 37% in 2021. Only 10% are more excited than concerned.1 Asked to name AI’s biggest risks in their own words, the most frequently cited answer was not job loss, not bias, not existential risk. It was “people becoming lazy or losing the ability to think for themselves.” That answer ties almost exactly with the published cognitive-debt research you’ll meet in Part 4.

graph LR
    A[2021<br/>37% concerned] --> B[2023<br/>52% concerned]
    B --> C[2025<br/>50% concerned<br/>10% excited]

    style A fill:#5cb85c,color:#fff
    style C fill:#d9534f,color:#fff

The picture is more layered than the US data alone suggests. Stanford’s 2026 AI Index, working from a 30-country Ipsos panel, found that the share of people globally who say AI offers more benefits than drawbacks rose to 59% — driven mostly by Asia (China 83%, Indonesia 80%, Thailand 77%).2 The wonder-to-skepticism arc is real but concentrated in the US, the UK, Germany, and Canada.

The most striking finding sits inside the same Pew dataset: when researchers asked AI experts the same questions, 47% were more excited than concerned, against 11% of the general public — a 36-point optimism gap.3 The people closest to the technology are the most positive about it. The people furthest from it are the most worried. Both of these things can be informative; neither is automatically right.

Why this matters for you

If you are running training, leading a team, or writing about AI, you are now talking to an audience whose default emotion has shifted. They are not asking can it do this. They are asking what is it doing to me. Meeting them at that question — rather than at the older “look what it can do” question — is the move this path is built around.


Part 2 — The trust ceiling

The single concept that hardened public skepticism is hallucination. People learned the word, applied it to something they themselves had been burned by, and used it to justify a quiet step backwards. The verdict that emerged sounds something like: AI is great for brainstorming, dangerous for facts.

This verdict is half right.

Hallucinations have measurably declined on bounded, grounded tasks. Vectara’s HHEM leaderboard, which tracks document-grounded summarisation, shows top models now hallucinating at 2-4% — down from 8-27% for the 2023 generation.4 Five-to-ten-fold improvement in three years.

But two recent findings make the harder picture clear. OpenAI’s own September 2025 paper Why Language Models Hallucinate concedes that hallucinations arise from “natural statistical pressures” in the loss function: they originate as errors in binary classification and are reinforced by evaluation benchmarks that reward guessing over abstention.5 In other words, hallucination is not a bug being patched. It is a feature of how the model learns to answer.

And when Vectara released a harder, longer, more diverse test set in late 2025, top-model hallucination rates went back up.6 The old benchmark was saturating. The underlying tendency had not been solved — only managed for a specific class of task.

graph LR
    M[Model-centric trust<br/>Which model hallucinates least?] --> H[Harness-centric trust<br/>What verifies the output?]

    style M fill:#d9534f,color:#fff
    style H fill:#5cb85c,color:#fff

The production frontier in 2026 is no longer “which model can I trust.” It is “what verification system wraps the model.” That shift is exactly the move from thinking about AI as a thing to thinking about it as one component inside a harness — the guardrails, the context, the evaluator-optimiser loop, the human checkpoints — that decide whether the output is grounded.

The trust shift

Stop asking “is the AI right?” Start asking “what is checking whether it is right?” The first question has no general answer. The second one always does, and the answer is your job.

Go deeper

The hallucination card explains the mechanism in detail. The harness-engineering card explains the verification structures that make outputs trustworthy.


Part 3 — The recipe shift, not the plateau

The second narrative driving the new mood is that AI is plateauing. Pre-training scaling has stalled; the data is running out; the curve is flattening. Ilya Sutskever made the case at NeurIPS 2024: “Pre-training as we know it will unquestionably end. We have but one internet.”7 Gary Marcus has been arguing it for years.8

They are right about pre-training. They are wrong about the overall curve, and the difference matters.

What actually happened in late 2024 was a recipe change. Frontier labs shifted compute from pre-training (more parameters, more tokens of internet text) to post-training (reinforcement learning on verifiable rewards) and to inference-time effort (letting the model think longer before answering). METR’s task-time-horizon metric — the duration of a software task an agent can complete with 50% reliability — actually accelerated its doubling time after late 2024, from roughly 196 days to ~89 days.9 Capability progress did not slow. The lever moved.

graph LR
    A[Pre-training scaling<br/>2020-2024] --> B[Plateaued]
    A --> C[Reasoning + inference compute<br/>2024-2026]
    C --> D[Still doubling<br/>every ~89 days]

    style B fill:#d9534f,color:#fff
    style D fill:#5cb85c,color:#fff

This is more than a technical footnote. The field literally moved from compute spent on automating pattern-matching to compute spent on thinking — running longer chains of reasoning, exploring more candidate paths, verifying intermediate steps. The “automation to reasoning” reframe in the title of this path is not a rhetorical flourish. It is the technical substrate of how the next generation of models is built.

The pricing picture has the same shape. Per-token costs for yesterday’s frontier collapsed: Epoch AI puts the decline at 9-900× per year, median 50×, accelerating to ~200× per year after January 2024.10 But frontier flagship pricing has only fallen ~12× over three years. The cheap tier is collapsing toward free. The premium reasoning tier is sustained at a 30× premium because the new compute lever — thinking — costs real money.

There is a separate, legitimate concern about the economics. Hyperscaler capex on AI infrastructure is projected at roughly $700 billion in 2026, with capex growth running ~80% against revenue growth of ~16%.11 Goldman Sachs and the MIT NANDA report have made the case that the absorption rate is too slow to clear historical hurdles.12 But this is a debate about equity multiples, not about whether the technology works. A correction would reprice the business model. It would not unbuild the capability.

The recipe-shift principle

The capability curve did not flatten. The compute moved from pattern-matching to reasoning. Plateau-talk in 2026 mostly reflects an out-of-date mental model of how progress is being made.


Part 4 — The dependency trap

The plateau debate is a distraction. The concern that actually matches the data is what happens to us when we use AI heavily.

Three studies from 2024-25 form the empirical core.

MIT Media Lab, “Your Brain on ChatGPT” (Kosmyna et al., 2025). Fifty-four participants wrote essays across four sessions while EEG tracked their brains. Three conditions: brain-only, search-engine-assisted, LLM-assisted. The LLM group showed the weakest neural connectivity, scored worst on memory recall of their own essays, and reported the lowest sense of ownership over the work. The researchers coined the phrase cognitive debt.13

Gerlich (Societies, 2025). A 666-participant study found a significant negative correlation between frequent AI tool use and critical thinking ability, mediated by cognitive offloading. Younger users were hit hardest.14

Fan et al. (BJET, 2025). A 117-student RCT split learners between writing essays with ChatGPT, writing without, and writing first then revising with ChatGPT. The from-the-start ChatGPT group produced the best essays and learned the least. The write-first-then-revise group preserved their learning. Same tool, same task, opposite outcomes.15

The most uncomfortable finding sits in software engineering — the domain where AI is supposedly most automated. METR’s July 2025 randomised trial gave experienced open-source developers AI coding tools and measured their actual speed. They were 19% slower with the tools than without. They reported feeling 20% faster. A 39-point gap between perceived productivity and real productivity — invisible to the people inside it.16

graph LR
    P[What you perceive<br/>+20% faster] -.-> R[What is measured<br/>-19% slower]

    style P fill:#5cb85c,color:#fff
    style R fill:#d9534f,color:#fff

This is the central data point of this path. Fluency is not competence. The feeling of working well is not evidence of working well. When the evaluation goes away, so does the gap.

A note on a phrase you hear in this discussion: model collapse. The technical phenomenon (Shumailov et al., Nature 2024) was that recursively training a model on its own outputs causes the data distribution to degenerate.17 Subsequent work (Gerstgrasser et al., 2024) showed this only happens when each generation replaces the previous one; in the realistic regime where synthetic data accumulates alongside real data, collapse does not occur.18 The original technical claim was overstated.

What is real and measurable is cultural slop: AI-generated content degrading specific information environments. Merriam-Webster named “AI slop” the 2025 Word of the Year. The curl project shut down its bug bounty program in January 2026 because of AI-slop vulnerability reports. Ahrefs measured that 74% of new web pages in 2025 contained AI-generated material. Two distinct phenomena — technical and cultural — that get confused under one label. Keep them separate so the worry stays defensible.

The honest worry

The right concern is not “AI will plateau and let us down.” The right concern is “AI will keep working, and the way most people use it will quietly degrade their thinking.” That is what the converging evidence supports. It is also what the public is reporting, in its own words, as its top concern about the technology.


Part 5 — The reframe

If the dominant frame for AI is automation and replacement, then the dominant question is who loses their job. If the frame is reasoning and mental models, then the dominant question is how do I think more clearly with this tool present. These are not the same question and they do not lead to the same actions.

The most counter-intuitive thing to come out of the 2024-26 research is that the augmentation reframe is being led from inside the major labs and consultancies, not by AI critics.

Ethan Mollick at Wharton (with the Boston Consulting Group) ran the foundational study: 758 BCG consultants, randomly assigned to use GPT-4 or not on 18 realistic business tasks. Inside the “jagged frontier” of tasks AI was good at, AI users completed 12.2% more tasks, 25.1% faster, with 40% higher quality. Outside that frontier, AI users were 19% less likely to get the right answer. Same model, opposite outcomes, depending on whether the task fit.19 Mollick named the three modes of working with AI: Centaurs divide labour with the model, Cyborgs weave AI into every step, Self-Automators abdicate to it. Centaurs and Cyborgs gain skill. Self-Automators do not.20

Andrej Karpathy, formerly head of AI at Tesla and a founding team member at OpenAI, gave the most-cited 2025 talk on the reframe: Software Is Changing (Again). His argument: software 1.0 was code, software 2.0 was weights, software 3.0 is English prompts as programs. The implicit frame is empowerment, not replacement. Anyone who can articulate intent in language is now a programmer.21

Anthropic’s own 2026 Agentic Coding Trends Report — published by the company most incentivised to overstate AI’s capabilities — admits that developers can fully delegate only 0-20% of tasks.22 The other 80-100% requires the human to remain in the loop. Coming from Anthropic, this is the closest thing to an official acknowledgement that the right mental model is augmentation, not autonomy.

graph TD
    M[Mode of engagement] --> C[Centaur<br/>Divide the work]
    M --> CY[Cyborg<br/>Interleave constantly]
    M --> SA[Self-Automator<br/>Abdicate to AI]

    C --> S1[Skill grows]
    CY --> S2[Skill grows]
    SA --> S3[Skill atrophies]

    style S1 fill:#5cb85c,color:#fff
    style S2 fill:#5cb85c,color:#fff
    style S3 fill:#d9534f,color:#fff

There is a darker counter to keep in view. Cory Doctorow points out that the augmentation framing assumes the worker has power over how AI is used. Workers in low-discretion roles — gig drivers, content moderators, warehouse pickers — find AI imposed on them as reverse centaurs: the human serves the model, not the other way round.23 Same technology, opposite affect, mediated by power.

The reframe holds for knowledge workers with discretion over their tools. It does not automatically hold for everyone the technology touches. That distinction matters when you talk publicly about the shift.

The reframe

Stop asking what AI replaces. Start asking what it lets you think more clearly about, and what it tempts you to stop thinking about. The first question makes you defensive. The second one makes you more capable.

Go deeper

The dynamic-load-shifting card describes how the cognitive balance between you and AI evolves as you grow more skilled.


Part 6 — The agentic horizon

This is where the gap between practitioner experience and public discourse is largest, and where most of the actual leverage is hiding.

In 2025-26, AI shifted from being something you talk to (a chat window) to something that does work for you (an agent). Claude Code, Cursor, OpenAI Operator, Anthropic Computer Use, browser agents, scoped research assistants. METR’s measurement tells the optimistic side: time-horizon doubling every ~89 days, with frontier models now reliably completing tasks that take a human about an hour.9

But agent capability is bounded in ways the headlines do not make obvious. METR itself published a careful caveat in January 2026: a 50%-time-horizon of X hours does not mean you can delegate X-hour tasks. Reliability-critical work needs 98%+ success. Doubling the time horizon does not double the degree of delegation — failures get harder to recover from at longer scales.24 On contamination-resistant coding benchmarks like SWE-bench Pro, top models still score in the low twenties.25

Where agents reliably work in 2026:

  • Scoped coding tasks where tests act as the oracle
  • Bounded research tasks with clear stopping conditions
  • Browser tasks with explicit human checkpoints

Where they reliably fail:

  • Long-horizon work without verifiable rewards
  • Open-ended desktop automation
  • Anything requiring stable repeated execution in novel environments

The honest summary: agents are powerful tools that require active supervision, not autonomous workers. The leverage is real, but it sits in systems you design — with orchestration, evaluator loops, human checkpoints, and an explicit autonomy spectrum — not in a single delegation moment.

graph LR
    L[Low autonomy<br/>You drive] --> M[Mid autonomy<br/>Agent drafts, you verify]
    M --> H[High autonomy<br/>Agent executes, you set guardrails]

    style L fill:#5cb85c,color:#fff
    style M fill:#4a9ede,color:#fff
    style H fill:#e8b84b,color:#fff

Public discourse is still arguing about whether agents work. The practitioner question moved on a year ago. It is no longer can they, it is for what tasks, with what verification, with what fallback. That is the conversation worth having.

The agentic question

Ask: what is the smallest scope at which this agent has a verifiable reward signal? That is the scope where it is useful. Anything beyond it requires you to design more of the harness.

Go deeper

The agentic-design path covers how to design these systems. The agentic-systems card explains the building blocks.


Part 7 — The full map

graph TD
    subgraph Mood
        MS[Mood shift:<br/>wonder → cognitive concern]
    end

    subgraph Frames
        TC[Trust ceiling:<br/>harness, not model]
        RS[Recipe shift:<br/>reasoning, not pre-training]
    end

    subgraph Risk
        DT[Dependency trap:<br/>fluency ≠ competence]
    end

    subgraph Reframe
        RE[From automation<br/>to reasoning]
    end

    subgraph Frontier
        AG[Agentic horizon:<br/>scoped autonomy]
    end

    MS --> TC
    MS --> RS
    TC --> DT
    RS --> DT
    DT --> RE
    RE --> AG

    style RE fill:#4a9ede,color:#fff
    style DT fill:#d9534f,color:#fff
    style AG fill:#5cb85c,color:#fff

Five moves, in order:

  1. Recognise the mood shift. People are not bored of AI. They are worried about what it is doing to their thinking.
  2. Move trust from the model to the harness. Hallucinations are not a model property to be solved. They are a system property to be verified.
  3. Update the plateau picture. Pre-training scaling did stall. Reasoning compute and inference-time effort kept the curve moving. The lever changed.
  4. Take cognitive debt seriously. It is the strongest empirical signal in the field, and the converged concern of the public, the labs, and the researchers studying use.
  5. Reframe automation as augmentation, with eyes open. Centaur and Cyborg modes grow skill. Self-Automator mode atrophies it. Reverse centaurs suffer. Mode of engagement, not the model, decides the outcome.

What you now understand

After this path, you should be able to:

  • Explain the wonder-to-concern shift in public sentiment and name what people are actually worried about
  • State why hallucinations have not been solved and what has changed about how trustworthy systems are built
  • Distinguish the pre-training plateau from the broader capability curve and explain what changed in late 2024
  • Name the cognitive-debt evidence and the perception-reality gap measured in expert workers
  • Distinguish technical model collapse from cultural AI slop and explain why conflating them weakens the argument
  • Describe the three modes of engagement (Centaur, Cyborg, Self-Automator) and predict which produces skill growth
  • Identify where agents reliably work, where they fail, and what the practitioner question about agents really is

Check your understanding


Where to go next

Path A — How to actually learn with AI

Read ai-self-learning for the practical loops that turn an LLM into a tutor instead of a crutch. The cognitive-debt research in this path is the why; the self-learning path is the how.

Best for: Anyone teaching themselves a new domain.

Path B — How to design AI systems

Read agentic-design for the architectural patterns behind the agents and harnesses described in Parts 2 and 6. This is where the trust shift becomes a concrete design discipline.

Best for: Builders, technical leads, product designers.

Path C — How humans actually learn

Read learning-science for the cognitive-psychology foundations behind retrieval practice, desirable difficulties, and the evidence-based learning strategies that explain why cognitive offloading degrades skill.

Best for: Educators, coaches, trainers.

Path D — Why this knowledge system exists

Read manifesto for the broader argument that AI without structured comprehension creates an illusion of competence, and the system Yiuno proposes to address it.

Best for: Readers who want the meta-argument behind everything else.


Sources


Further reading

Resources

Footnotes

  1. Pew Research Center (2025). How the U.S. public and AI experts view artificial intelligence. The 50%-concerned / 10%-excited US trendline and the open-ended risk responses.

  2. Stanford HAI (2026). The 2026 AI Index Report — Public Opinion Chapter. 30-country Ipsos panel showing global optimism rising to 59%, with concentration in Asia.

  3. Pew Research Center (2025). How the U.S. public and AI experts view artificial intelligence. The 36-point expert/public optimism gap.

  4. Vectara (2026). Hallucination Leaderboard (HHEM-2.3). Document-grounded summarisation rates for current frontier models.

  5. Kalai, A. T., Nachum, O., Vempala, S., & Zhang, E. / OpenAI (2025). Why Language Models Hallucinate. arXiv. The “natural statistical pressures” framing and evaluation-incentive analysis.

  6. Vectara (2025). Introducing the Next Generation of Vectara’s Hallucination Leaderboard. Evidence that harder test sets bring rates back up.

  7. Vincent, J. (2024). OpenAI cofounder Ilya Sutskever predicts the end of AI pre-training. The Verge. Source for the “we have but one internet” framing.

  8. Marcus, G. (2025). Breaking: OpenAI’s efforts at pure scaling have hit a wall. Substack. The cleanest skeptic statement of the plateau thesis.

  9. METR (2026). Time Horizon 1.1. Updated doubling-time analysis showing post-2024 acceleration to ~89 days. 2

  10. Epoch AI / Cottier, B. et al. (2025). LLM inference prices have fallen rapidly but unequally across tasks. Per-token cost decline analysis.

  11. Elias, J. (2026). Tech AI spending approaches $700 billion in 2026, cash taking big hit. CNBC. 2026 hyperscaler capex figures and free-cash-flow impact.

  12. Goldman Sachs Research (2025). Are AI Bubble Concerns Warranted or Overblown?. Mainstream-finance framing of the absorption-rate debate.

  13. Kosmyna, N. et al. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. MIT Media Lab / arXiv. The EEG study and the coining of “cognitive debt.”

  14. Gerlich, M. (2025). AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking. Societies, 15(1), 6. The 666-participant cognitive-offloading study.

  15. Fan, Y. et al. (2024/25). Beware of Metacognitive Laziness: Effects of Generative Artificial Intelligence on Learning Motivation, Processes, and Performance. British Journal of Educational Technology. The 117-student RCT.

  16. METR (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. The 19%-slower / 20%-perceived-faster RCT.

  17. Shumailov, I. et al. (2024). AI models collapse when trained on recursively generated data. Nature, 631, 755-759. The original technical claim.

  18. Gerstgrasser, M. et al. (2024). Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data. arXiv / ICML workshop. The decisive technical rebuttal.

  19. Dell’Acqua, F., McFowland, E., Mollick, E. et al. (2023, updated 2026). Navigating the Jagged Technological Frontier. HBS Working Paper 24-013. The 758-consultant BCG study.

  20. Mollick, E. (2025). On Working with Wizards. One Useful Thing. The most current statement of the Centaur / Cyborg / Self-Automator framing.

  21. Karpathy, A. (2025). Software Is Changing (Again). YC AI Startup School keynote, June 2025. The Software 3.0 framing.

  22. Anthropic (2026). 2026 Agentic Coding Trends Report. The 0-20% full-delegation finding.

  23. Doctorow, C. (2025). Reverse centaurs are the answer to the AI paradox. Medium / Pluralistic. The counter to the augmentation framing for low-discretion workers.

  24. Kwa, T. (2026). Clarifying limitations of time horizon. METR. Caveats on what time-horizon numbers do and do not imply.

  25. OpenAI (2026). Why SWE-bench Verified no longer measures frontier coding capabilities. Lab acknowledgement of benchmark contamination and saturation.