Explainers

Search-friendly explainers — definitions, attack paths, and defenses, written so that one passage can answer one question.

May 2026·AI Security

What Is Indirect Prompt Injection? Threat Model and Defenses

Indirect prompt injection is an attack where instructions hidden in content an LLM agent reads as data — an email, web page, RAG chunk, tool result, or tool description — are interpreted by the model as operational authority. The user asked for something benign; the attacker, who never spoke to the user, ends up dictating tool calls. Prompt-level defenses reduce but do not eliminate it; the durable fix is to constrain what untrusted text can cause, not what the model can recognize.

May 2026·AI Security

What Is Agent Capability Control? Authority, Not Prompts, Is the Boundary

Agent capability control is the practice of authorizing what a tool-using AI agent can do outside the model, using narrow, expiring, unforgeable grants minted per task. Instead of asking the language model to behave under attacker-controlled input, a deterministic broker hands it explicit handles to specific resources and tools. A successful prompt injection then becomes a denied tool call, not an authorized email, PR, or API spend.

May 2026·Applied Intelligence

What Is a Compound AI System? Why the System Boundary Is the Unit of Analysis

A compound AI system is an automation system in which a foundation model is one component among retrievers, tools, memory, planners, validators, sandboxes, and policy engines that together complete a multi-step objective. Performance and risk are produced by the system boundary, not by a single model call. The model is an untrusted probabilistic planner; the runtime around it is the unit of trust, the unit of authorization, and the unit of accountability.

May 2026·Applied Intelligence

What Is Tool-Use Reliability? The Five-Layer Stack Behind Safe AI Agent Actions

Tool-use reliability is the end-to-end property that an AI agent translates user intent into a tool call that is syntactically valid, schema-conformant, semantically correct, state-consistent, and authorized — and contains failures when it is not. Function calling and structured outputs only address the lower two layers. The upper three layers require validators, policy, and capability control outside the model.

May 2026·Human Learning

What Is a Generative AI Tutor? Architecture, Evidence, and Failure Modes

A generative AI tutor is an adaptive learning system that uses a large language model as the dialogue and explanation layer inside a controlled instructional loop: evidence capture, learner modeling, pedagogical policy, grounded generation, orchestration, and outcome logging. It is the architectural successor to intelligent tutoring systems, not a chatbot bolted onto a curriculum. Its defining design constraint is that the tutor must regulate learner effort, not minimize it.

May 2026·Human Learning

What Is Cognitive Offloading in AI-Assisted Learning? Performance Gains, Learning Loss, and Design Patterns

Cognitive offloading is using an external tool to reduce internal mental work. AI assistants offload many operations at once — framing, decomposition, evaluation — which can produce a performance-learning gap: assisted output improves while independent capacity erodes. Bastani et al. (PNAS 2025) measured a 48% practice gain but a 17% drop on unassisted exams. Pedagogically constrained tutors avoid that gap. The design question is which operations the system performs and which it preserves for the learner.

May 2026·AI Security

What Is Agentic Patch Validation? From Plausible to Deployable in Automated Vulnerability Repair

Agentic patch validation is the problem of deciding whether a patch produced by an LLM repair agent is candidate, plausible, correct, or deployable, and never confusing the four. Most published AVR numbers are plausible-patch numbers: build passes, the original PoC no longer crashes, and existing tests pass. Security engineering needs at least correct, ideally deployable, which requires PoC+ tests, fuzzing as validation, differential checks, and a validator the agent cannot edit.

May 2026·AI Security

What Is Agentic Binary Reverse Engineering? Architecture, State of the Art, and Failure Modes

Agentic binary reverse engineering is an execution architecture, not a model. An LLM-driven loop plans, calls reverse-engineering tools like Ghidra, IDA, angr, and debuggers, observes results, preserves evidence, and revises hypotheses toward a deterministic goal such as malware classification or vulnerability discovery. The shift from assisted to agentic is the closed loop and the chain of evidence; performance is now dominated by how the system reasons and validates, not raw model size.

May 2026·Human Learning

What Is Pedagogical Safety? Protecting Learning in AI Tutoring Systems

Pedagogical safety is protection against avoidable educational harms in AI tutoring systems. A tutor response is pedagogically safe when it preserves the learner's opportunity to retrieve, reason, explain, and transfer — even when withholding help feels less helpful in the moment. It is not content moderation and it is not answer refusal. It is mode-awareness: choosing the right instructional move from learner-state evidence, then evaluating the choice against durable learning outcomes.

Apr 2026·AI Security

What Is Multi-Agent Prompt Injection? Attack Paths and Defenses

Multi-agent prompt injection is indirect prompt injection whose carrier is another agent. An attacker plants instructions in content that an upstream agent reads, summarizes, or relays. By the time the message reaches a planner or tool-using agent, its provenance is laundered and the system treats attacker text as operator intent. The blast radius equals the agent network's tool surface, which is why defenses must be architectural, not just prompt-level.

Apr 2026·AI Security

What Is Voice-Agent Jailbreaking? Why Spoken Prompt Injection Is Different

Voice-agent jailbreaking is prompt injection delivered through a real-time spoken conversation, where an attacker convinces an agent to fire a tool, leak data, or take an action its instructions forbid. The voice channel strips out encoding tricks but adds time pressure, emotional bandwidth, and conversational momentum. The attacks that succeed are social engineering patterns, and the only durable defenses move enforcement out of the system prompt and into deterministic capability and policy checks.

Apr 2026·AI Security

What Are Glitch Tokens? Under-Trained Tokens and the LLM Attack Surface

Glitch tokens are entries in a language model's vocabulary that, when included in a prompt, disproportionately produce anomalous output: incoherence, refusals, truncations, loops, or silent data corruption. The usual root cause is under-training, where the tokenizer's vocabulary contains tokens the model rarely saw during pretraining. Their embeddings are unstable, so any prompt that uses them is unstable too. Glitch tokens are a reliability and security concern, not a single CVE-class bug.

Apr 2026·AI Security

What Is Tool Hijacking? When AI Agents Run the Wrong Function

Tool hijacking is the class of attack where an adversary steers an AI agent's function calls: which tool it invokes, with what arguments, and how it interprets the result. The model's English output is mostly recoverable, but its tool calls are not. Emails get sent, money moves, code runs. Defenses live at the function-call interface: per-session tool wiring, deterministic policy checks, and least-privilege credentials, not prompt-level guardrails alone.

Apr 2026·AI Security

What Is RAG Data Exfiltration? Retrieval-Channel Attacks on LLM Apps

RAG data exfiltration is a class of attack where a retrieval-augmented generation system leaks sensitive data because of what it retrieved. An attacker plants instructions or formatting payloads in the corpus; a legitimate user query pulls them into the model's context; the model echoes attacker text, emits markdown that smuggles data out, or repeats the system prompt. The vector store becomes an instruction surface.

Apr 2026·AI Security

LLM-Assisted Malware Reverse Engineering: What Works and Where the Risks Are

LLM-assisted malware reverse engineering uses a language model as a copilot inside Ghidra or IDA to translate decompiled code into function names, summaries, and step-by-step explanations. It speeds up triage by making the call graph readable in minutes instead of hours. The model proposes; the analyst verifies. It does not unpack binaries, it does not replace analyst judgment, and it introduces three risks: hallucinated annotations, data exfiltration through hosted inference, and adversarial content embedded in the sample itself.