Designing Generative AI Tutors That Protect Learning, Not Just Practice Performance
Quick Answer
This checklist hardens a generative AI tutor against the failure mode recent RCTs have documented: systems that boost practice scores while reducing unaided learning. It targets learning-systems engineers, instructional designers, and learning-experience leads building or reviewing a tutor. Use it per release and before any model-version upgrade. Each check names a control, an implementation pointer, and a verification signal, grouped by execution area from domain model through evaluation and human escalation.
This checklist is for teams shipping a generative AI tutor into a real course, training program, or after-school deployment. It translates the design synthesis in Generative AI tutors and personalized adaptive learning systems into executable controls. The threat model is the result Bastani et al. (PNAS 2025) made unavoidable: an unguarded GPT-4 tutor raised practice grades by 48% while reducing later unassisted exam performance by 17%; a teacher-designed guarded tutor raised practice grades by 127% and largely mitigated the harm. For the conceptual frame and mechanisms, see what is a generative AI tutor. Some implementation specifics — bypass patterns, vendor system prompts, redacted rubric content from cited deployments — are withheld; this checklist stays at the level of design controls and verification signals.
How to use this checklist
Run it once during architecture review, again per release, and again before any model-version upgrade. The owner is whoever signs off on instructional quality — usually the learning-experience or instructional-design lead, partnered with the platform engineer who owns the orchestrator. "Done" means every MUST has a verification signal a reviewer outside the build team can read. Pair this with the source paper's evaluation hierarchy when planning outcome studies.
Domain model and grounding
3 checksMaintain an explicit multi-granularity domain model
MUSTWhy it matters
A tutor that grounds only in flat document chunks cannot reason about which knowledge component a learner is missing or which misconception is firing. Without explicit objective → KC → misconception → representation → transfer-class structure, hint selection collapses into "say something relevant," which is the regime where helpfulness substitutes for learning.
How to implement
Author the domain model as a versioned data artifact (YAML, database, or knowledge graph) separate from prompts. Every instructional item, hint, and assessment item references a KC and, where applicable, a known misconception. The orchestrator selects content by KC and misconception, not by embedding similarity alone.
Verify it's done
A reviewer can query "show all hints for KC X tagged against misconception Y" and get a non-empty, reviewed result. No hint reaches the learner without a KC tag in the log.
Tag retrieved instructional content by pedagogical type
MUSTWhy it matters
A definition, a misconception warning, a Socratic prompt, a worked example, and a rubric play different instructional roles. Untyped retrieval lets the model use a worked example where it should have used a prompt, which is one mechanism by which tutors substitute for productive struggle.
How to implement
Type every corpus item at ingestion (definition, misconception, prompt, worked-example, rubric, transfer-item). The help policy decides which types are eligible at the current hint level and mode; retrieval filters on type before ranking.
Verify it's done
Logs show the eligible type set and the selected type for each turn. A randomly sampled session can be replayed to confirm no answer-revealing types appeared before the policy permitted them.
Add an explicit refusal path for out-of-corpus questions
SHOULDWhy it matters
Hallucinated instruction is worse than no instruction; it teaches incorrect content under the tutor's authority. Without a refusal path, the model will fluently answer items the curated corpus does not cover.
How to implement
When retrieval confidence falls below a threshold or no KC matches, the tutor returns a scoped refusal ("this is outside what I've been prepared to teach") and routes to a human or to a "log for content authoring" queue.
Verify it's done
Synthetic out-of-corpus probes in the regression suite consistently trigger refusal rather than confident answers. The content-gap queue is non-empty and reviewed.
Learner model and help-seeking
2 checksTrack help-seeking state as a first-class learner-model construct
MUSTWhy it matters
Generative AI changes which behaviors predict learning. A learner can complete every item correctly while substituting the tutor for cognition. Modeling only knowledge and misconceptions misses the failure mode the PNAS 2025 result exposed.
How to implement
Add at least four constructs to the learner model: knowledge state, misconception state, self-regulation state, and help-seeking state (productive vs. substitutive). Update help-seeking from per-turn signals: time to first attempt, hint level reached, explanation quality, and the assisted-vs-unassisted correctness gap.
Verify it's done
Every session writes a help-seeking trajectory. Aggregate dashboards distinguish learners with high assisted correctness and low unassisted correctness from learners gaining on both.
Instrument substitution signals at the turn and session level
SHOULDWhy it matters
Substitution leaves measurable traces — rapid hint escalation, "just tell me" patterns, copy-paste from tutor output to submission, skipped self-explanation prompts. If these are not logged, no dashboard can detect the harm.
How to implement
Emit structured events for hint-level transitions, time-to-first-keystroke, answer-reveal requests, explanation-prompt skips, and clipboard-style paste events into submissions where the platform can observe them. Aggregate per learner and per KC.
Verify it's done
A reviewer can pull the substitution dashboard for any cohort and see distributions, not just averages. At least one alarm threshold (e.g., session-level hint-ladder skip rate) is wired.
Help policy and effort regulation
4 checksDocument the help policy as a versioned, inspectable artifact
MUSTWhy it matters
A help policy buried in a system prompt cannot be reviewed, tested, or governed. Different model versions will interpret the same prose differently, and instructional designers cannot revise policy without prompt-engineering access.
How to implement
Maintain the help policy as code or structured config that the orchestrator reads — when to ask vs. hint vs. show a worked example vs. reveal an answer vs. request self-explanation vs. escalate. The system prompt enforces the policy's outputs but does not encode the policy.
Verify it's done
The policy artifact has a version, an owner, and unit tests. A change to the policy produces a diff, not a prompt rewrite.
Enforce attempt gates before substantive help
MUSTWhy it matters
Without an attempt gate, the cheapest path through an item is to ask the tutor first. Bastani et al. (PNAS 2025) is the operational evidence that this trains a habit which survives the tutor's removal at exam time.
How to implement
Practice mode requires the learner to commit to something — a prediction, an identified quantity, a chosen formula, partial code, or a stated reasoning step — before the tutor will provide more than an orienting question. Worked-example mode replaces the gate with a comparison or self-explanation requirement.
Verify it's done
Logs show a non-empty learner attempt preceding any hint at level ≥ 2 in practice mode. Sessions where this invariant is violated are flagged as bugs.
Implement a hint ladder and log the level reached
MUSTWhy it matters
Tutors that flatten hints into "best response" lose the gradient that protects effort. The level the learner reaches is also the most useful single feature for detecting substitution and over-help.
How to implement
Define a discrete ladder — orienting question → relevant principle → local error cue → partial step → worked example → final answer (only when justified). The orchestrator chooses the next level based on policy and learner state, never skipping unless the policy permits it.
Verify it's done
Every helpful turn logs hint_level. Per-KC distributions of hint level are inspectable. Learners who routinely reach the highest level surface in the substitution dashboard.
Make mode separation explicit to the orchestrator and to the learner
MUSTWhy it matters
The same prompt should produce different behavior in practice, worked-example, and assessment contexts. Blurring the boundary is how assessment integrity erodes and how worked-example fading collapses into answer-giving.
How to implement
The orchestrator maintains an explicit mode. Practice withholds answers; worked-example shows but requires comparison or fading; assessment disables AI help entirely. The UI surfaces the current mode to the learner. Homework/study contexts default to practice mode unless the instructor configures otherwise.
Verify it's done
Mode is recorded on every turn. A simulated assessment session confirms the tutor refuses help. UX review confirms the learner can see the active mode without hunting for it.
Verification and dialogue integrity
2 checksCall external verifiers in formal domains
MUSTWhy it matters
Solving ability is not tutoring ability — see the solver-tutor gap. In math, code, and proof, an LLM that asserts correctness without a verifier will eventually authorize wrong answers fluently, and the tutor's authority makes that worse than a search result.
How to implement
Route formal claims through a symbolic solver, code sandbox, or proof checker before delivering them. Where a verifier disagrees with the model, the tutor either rephrases as a hint or escalates rather than overriding the verifier.
Verify it's done
Verifier-disagreement events appear in the log and trend toward zero with content fixes. No final-answer turn in a formal domain ships without a verifier pass.
Maintain dialogue-level coherence state across turns
SHOULDWhy it matters
MathTutorBench (Macina et al., EMNLP 2025) shows tutoring quality degrades over multi-turn dialogues. Without a coherence layer, the tutor forgets which hints it has given, which misconceptions remain, and which reasoning steps the learner has committed to.
How to implement
Carry a structured dialogue state — attempts made, hints given by level, misconceptions still active, reasoning steps the learner has staked — separate from the chat transcript. The orchestrator reads from this state, not from raw history, when choosing the next move.
Verify it's done
Replaying a session deterministically reproduces the orchestrator's decisions from the dialogue-state log. Long-dialogue regression tests do not show degradation in scaffolded-help selection.
Evaluation, governance, and escalation
5 checksSchedule unassisted retrieval and transfer checkpoints
MUSTWhy it matters
Practice correctness with the tutor available is the metric most likely to mislead. Without scheduled unassisted items, the team cannot detect the PNAS-style harm before it ships.
How to implement
Insert periodic checkpoint items the learner solves with AI help disabled. Include near-transfer and at least one far-transfer class per KC cluster. Report assisted-vs-unassisted gap by learner, KC, and cohort.
Verify it's done
The cohort dashboard plots assisted and unassisted correctness side by side. Gap-widening is treated as a release-blocking signal, not a curiosity.
Evaluate tutor quality and learning gains against an active baseline
SHOULDWhy it matters
"AI tutor vs. nothing" is not a defensible comparison. Existing adaptive practice, active-learning lectures, and human tutoring have known effect sizes; a new tutor must be measured against one of them, and tutoring-quality benchmarks are how to catch regressions before learners do.
How to implement
Use tutor-specific benchmarks (MathTutorBench, MRBench, TutorBench, BEA 2025) for response-quality regression. For learning evaluations, run RCT or quasi-experimental designs against an active comparator — not an unguarded LLM and not a no-tutor control alone.
Verify it's done
A standing benchmark report exists per release. Outcome studies are pre-registered or at minimum design-reviewed before launch. Comparator is named in the study plan.
Pin model versions and gate upgrades on regression and outcomes
SHOULDWhy it matters
A silent provider model upgrade can change help-policy behavior, verifier disagreement rates, and substitution patterns overnight. Treating the model as a stable commodity ignores the only component most likely to change without notice.
How to implement
Pin model versions per environment. Log the version on every session. Gate upgrades on the tutor-benchmark regression suite and on a smaller live-cohort revalidation of unassisted gains before broad rollout.
Verify it's done
Session logs include model_version. A rollback path exists and has been exercised. No production upgrade has shipped without a passing regression report.
Define and route human-escalation rules
SHOULDWhy it matters
Some signals are not the tutor's to handle — repeated misconception after multiple scaffolds, affective distress, integrity risk, ambiguous correctness, low-confidence/low-performance combinations, and high-stakes assessment prep. The tutor's defaults will paper over these without an explicit route.
How to implement
Encode escalation rules in the help policy. Surface them to the responsible facilitator (instructor, coach, support staff) with the conversation context the human needs. For deployments without a human in the loop, the escalation becomes a documented refusal plus a referral.
Verify it's done
Escalation events are auditable. A facilitator can show which escalations they received last week and what they did about them. No escalation rule is dead code.
Surface assisted-vs-unassisted gaps in the facilitator dashboard
NICEWhy it matters
Time-on-task and engagement dashboards are precisely the metrics on which a substituting tutor will look successful. The facilitator needs the metrics that distinguish learning from completion.
How to implement
The facilitator view surfaces top misconceptions, students needing review, KCs with high assisted but low unassisted performance, over-help patterns, and explanation-quality trends. Engagement metrics are present but not primary.
Verify it's done
A facilitator can answer "which of my students has the largest assisted-vs-unassisted gap on KC X?" without opening a notebook. The dashboard's primary chart is not time on task.
Acceptance criteria
The checklist is fully implemented when (1) every MUST has a verification signal a reviewer outside the build team can read; (2) the help policy is a versioned artifact, not a prompt; (3) substitution signals and assisted-vs-unassisted gaps are visible on the cohort and facilitator dashboards; (4) at least one outcome study against an active baseline is pre-registered or completed; (5) the regression suite covers tutoring-quality benchmarks and gates model upgrades; and (6) sibling artifacts in the checklists library covering content authoring and outcome-study design are linked from the program runbook. A tutor that meets these criteria can still be wrong about content; one that does not meet them cannot reliably distinguish learning from practice performance, which is the failure this checklist exists to prevent.