What AI Assistance Means for Learning Leaders: The Performance-Learning Gap
Quick Answer
Three converging 2025 studies show AI assistance produces two outcomes that look identical from outside: better assisted performance and worse independent capability. Unrestricted chatbots create a measurable performance-learning gap; structured tutors with attempt-first gating and teacher-authored hints can outperform active learning. The procurement question for learning leaders is not whether to adopt AI, but what to require of it before it touches learners.
Key Takeaway
Generic AI chatbots reliably improve assisted task performance while degrading independent capability; only structured, pedagogically constrained tutors deliver durable learning, and the design choice is yours to require.
A consistent finding across three large 2025 studies is reshaping how senior learning leaders should evaluate AI in education and L&D. Generic AI chatbots reliably improve assisted task performance while degrading independent capability; only structured, pedagogically constrained tutors deliver durable learning, and the design choice is yours to require. Assisted scores and durable capability look identical on a dashboard but diverge sharply when the AI is taken away. The procurement question is not whether to adopt AI for learning — workforce data shows that ship has sailed — but what to require of any system before it reaches learners. The mental model behind this gap is covered in cognitive offloading in AI-assisted learning.
What this means for your organization
The exposure is a delayed-signal problem. An institution can adopt an unrestricted chat assistant, watch practice scores and completion metrics climb for two to four years, and then discover in delayed assessments, certification pass rates, or new-hire performance that the cohort has weaker independent reasoning than its pre-AI peers. Bastani et al. (PNAS 2025) measured exactly this shape: +48% on assisted practice, −17% on a later unassisted exam, in a preregistered field experiment with about a thousand high-school math students.
Three operating environments are most exposed: K–12 and higher-ed programs adopting AI tutors at scale, corporate L&D rolling out AI-assisted training and onboarding, and any setting where competence development — not artifact throughput — is the point. The least-prepared learners are most vulnerable: they offload the target cognition rather than the incidental load, and they cannot audit AI output well enough to catch errors. In professional contexts, AI handling routine cases removes the practice repetitions that maintain expert judgment over time.
The countervailing evidence matters as much as the warning. Kestin et al. (Scientific Reports 2025) showed a structured AI tutor outperformed in-class active learning by roughly 0.63 SD in a Harvard physics RCT. The lever is instructional design, not model capability.
What to ask your team
Have we measured AI-off transfer performance against a pre-AI or control baseline before scaling any AI learning tool?
Does the system we are evaluating implement attempt-first gating and teacher-authored hint ladders, or is it a chat box over a frontier model?
What process data — attempts, hint usage, revisions, self-explanation — does the platform capture, and how would we know if learning actually happened?
Where in our curriculum or training program do learners practice auditing AI output, including hallucinated citations and adversarial cases?
Do our grading or performance KPIs reward independent capability, or only final artifacts that AI can produce on the learner's behalf?
What good looks like
A learning-grade AI deployment has these properties at the architectural level:
- Attempt-first interaction. Learners produce a first step before help is offered. The system responds to their reasoning, not to the literal request "solve this."
- Teacher- or designer-authored constraints. Solutions, hint ladders, and scope boundaries are authored by the people responsible for the learning outcome — not inherited from a generic model.
- Process instrumentation. Attempts, revisions, hint requests, self-explanation, and confidence calibration are captured and reviewable.
- Periodic AI-off assessment. Transfer is measured on independent tasks at intervals, against a baseline, and bad signals trigger redesign rather than rationalization.
- Scaffold fading. Support reduces as competence grows; a system that always provides the same level of help manufactures dependence.
- AI auditing as a learning target. Curricula include adversarial outputs and verification tasks so learners build the evaluation skill the workplace already requires.
These are properties to look for, not implementation steps. Vendors who can demonstrate them in their platform have done the pedagogical work; vendors who cannot are selling an answer engine.
Where to dig deeper
- Cognitive offloading in AI-assisted learning — the mental model behind the performance-learning gap, with the mechanism in detail.
- Learning effects of AI assistance on critical thinking and cognitive offloading — the source synthesis paper with the full evidence map across Bastani, Kestin, and Lee.
- Other learning briefs — sibling executive briefs on adjacent decisions in AI-assisted learning.
- Bastani et al., PNAS 2025; Kestin et al., Scientific Reports 2025; Lee et al., CHI 2025 (links in references).
FAQ
How exposed is our organization if we adopt a general-purpose chatbot for learning?
Bastani et al.'s preregistered field experiment with roughly 1,000 high-school math students found that unrestricted GPT-4 access boosted in-practice scores by 48% but produced a 17% deficit on a later unassisted exam compared to a control group. An answer-engine deployed as a tutor will reliably show good practice metrics and quietly weaken independent capability. The exposure is delayed, which is what makes it dangerous in procurement.
Does the evidence say AI tutoring is bad for learning?
No. Kestin et al. (Harvard physics, 2025) showed a structured AI tutor — expert prompts, prewritten solutions, stepwise scaffolding, active engagement — outperformed in-class active learning at roughly 0.63 standard deviations, with less time on task. Design constraint is the lever, not AI presence. The same Bastani study showed a guarded "GPT Tutor" condition produced large practice gains with no exam penalty.
What should we require of any AI learning system before procurement?
Attempt-first gating before help is offered, hint ladders rather than direct answers, instrumented process data (attempts, revisions, self-explanation), periodic AI-off transfer assessment against a baseline, and teacher or learning-designer authoring of solutions and constraints. If a vendor cannot show these, they are selling an answer engine with a learning label.
Will banning AI solve the problem?
No. Lee et al.'s 2025 CHI survey of 319 knowledge workers across 936 AI-use examples found AI use is already pervasive and that critical thinking was enacted in only 59% of cases. Bans push usage out of view and forfeit the upside seen in the structured-tutor evidence. The leadership task is shaping how AI is used, not whether.
Derived From
External References
FAQ
How exposed is our organization if we adopt a general-purpose chatbot for learning?
Bastani et al.'s preregistered field experiment with roughly 1,000 high-school math students found that unrestricted GPT-4 access boosted in-practice scores by 48% but produced a 17% deficit on a later unassisted exam compared to a control group. An answer-engine deployed as a tutor will reliably show good practice metrics and quietly weaken independent capability. The exposure is delayed, which is what makes it dangerous in procurement.
Does the evidence say AI tutoring is bad for learning?
No. Kestin et al. (Harvard physics, 2025) showed a structured AI tutor — expert prompts, prewritten solutions, stepwise scaffolding, active engagement — outperformed in-class active learning at roughly 0.63 standard deviations, with less time on task. Design constraint is the lever, not AI presence. The same Bastani study showed a guarded 'GPT Tutor' condition produced large practice gains with no exam penalty.
What should we require of any AI learning system before procurement?
Attempt-first gating before help is offered, hint ladders rather than direct answers, instrumented process data (attempts, revisions, self-explanation), periodic AI-off transfer assessment against a baseline, and teacher or learning-designer authoring of solutions and constraints. If a vendor cannot show these, they are selling an answer engine with a learning label.
Will banning AI solve the problem?
No. Lee et al.'s 2025 CHI survey of 319 knowledge workers across 936 AI-use examples found AI use is already pervasive and that critical thinking was enacted in only 59% of cases. Bans push usage out of view and forfeit the upside seen in the structured-tutor evidence. The leadership task is shaping how AI is used, not whether.