What Is Agent Capability Control? Authority, Not Prompts, Is the Boundary
Quick Answer
Agent capability control is the practice of authorizing what a tool-using AI agent can do outside the model, using narrow, expiring, unforgeable grants minted per task. Instead of asking the language model to behave under attacker-controlled input, a deterministic broker hands it explicit handles to specific resources and tools. A successful prompt injection then becomes a denied tool call, not an authorized email, PR, or API spend.
What Is Agent Capability Control? Authority, Not Prompts, Is the Boundary
Tool-using AI agents put a probabilistic planner — a language model — at the junction of three things that traditional systems kept apart: untrusted content, private data, and side-effecting capabilities. Once attacker-controlled text can influence which tool runs against which resource, that text is participating in authorization. Capability control is the architectural response: move the authorization decision out of the model and onto narrow, expiring grants enforced by a deterministic runtime.
What is agent capability control?
Agent capability control is the practice of authorizing an AI agent's actions through explicit, unforgeable, attenuable handles — capabilities — rather than through ambient privilege or prompt-level guardrails. A capability names exactly one operation on one resource, with a scope, a lifetime, and provenance. It is minted per task by a service the model does not control, and it is enforced by a broker the model cannot bypass.
The mental model is a property worth memorizing: the planner proposes; a deterministic broker disposes. The model's job is to suggest the next tool call. The broker's job is to decide whether the capability backing that call exists, is in scope, and is allowed to compose with the data flowing through it. If the capability is absent, the call fails closed regardless of what the model was convinced to do.
In short: capability control treats authorization as code outside the model, not as a request to the model.
How does it work?
Capability control works by changing what the agent has, not by changing what the agent reads. The mechanism has six moving parts.
- The boundary moved. In classical sandboxing the unit to confine is code. With agents, the code may be benign and the API authenticated, but the action sequence is chosen by an LLM exposed to attacker text via indirect prompt injection through email, retrieved documents, web pages, issues, or memory. The new boundary is "who or what may cause a capability to be exercised."
- The lethal trifecta. The recurring failure shape is one planner loop holding all three of: an untrusted instruction source, a sensitive data source, and an exfiltration or side-effect channel. Any two are manageable. All three is exploitable. Public incidents in 2025, including EchoLeak (CVE-2025-32711) and a demonstrated GitHub MCP toxic-flow pattern, are examples of the class — described here only at that class level.
- Why prompt-level defenses are not a security boundary. System prompts, instruction-hierarchy training, and "ignore tool-output instructions" rules ask the same component under attack to police itself. Detectors and classifiers help as signal but are semantically defeatable, and adaptive fuzzing routinely punches through static prompt defenses. Tool allowlists also miss composition:
read_email,read_docs, andsend_emailare each individually fine; their composition is exfiltration. This is the core dynamic in tool hijacking. - Capability control inverts the model. Instead of inheriting ambient authority — host env vars, cloud creds, browser cookies, every registered MCP tool — the agent receives explicit handles such as
repo:read(owner/proj, branch=main),email:send(to=customer@example.com, no_attachments=true), orsandbox:exec(image=python:3.12, no_network=true). Each handle is narrow, attenuable, and expiring. - Capabilities bind data, action, and purpose. A correct authorization decision is flow-sensitive. It asks which user, which task, which data origins influenced this argument, which capability is being exercised, whether the action crosses a trust zone, whether arguments are tainted by untrusted content, and whether a fresh human confirmation is required. This is not classical RBAC; it is capability plus information flow.
- Capabilities are minted per task and short-lived. A research task does not hold private-repo read. A README-update task does not hold package-publish. A CSV-analysis task does not hold network egress. The minting service prefers downstream-native scopes — fine-grained GitHub tokens, IAM session policies, database views, signed URLs — and the broker emulates scope where the downstream service cannot.
Information-flow control is the natural complement. Confidentiality and integrity labels travel with data. Untrusted external content is forbidden from selecting privileged actions; high-confidentiality data is forbidden from flowing to public sinks without explicit declassification. CaMeL, Fides, and SAMOS are research statements of this direction. The result is deny by construction: if the task has no external-send capability, the agent cannot exfiltrate by email; if the sandbox has no network, generated code cannot beacon out.
Why does it matter?
Without capability control, a successful prompt injection becomes an authorized action: a real email, a real PR, a real cloud API call, a real wire transfer. The blast radius equals whatever the agent inherited, which in early deployments is typically "everything the developer or user can do." That is how a single poisoned document, issue comment, or calendar invite turns into a CVSS 9.x zero-click data-exfiltration story.
With capability control, the same injection becomes a denied tool call, a quarantined memory write, a blocked egress attempt, or an approval prompt that shows provenance. The model still gets confused. The runtime still says no. The audit log records a policy denial instead of a breach.
This artifact intentionally withholds operational reproduction details for specific public incidents (EchoLeak, the GitHub MCP toxic flow, MCP server command-injection chains, the Amazon Q Developer extension incident); only architectural class behavior is described. Readers who want full technical depth should consult the cited advisories and the source paper.
How do you defend against it?
These are the load-bearing controls. Each names what it costs and what it does not cover.
- Remove ambient authority. Stop passing host environment variables, home directories, Docker sockets, SSH agents, developer cloud credentials, or shared browser cookies into agent runtimes. Cost: every tool needs an explicit credential plumbing path. Does not cover: downstream tokens that are themselves overscoped — the broker still has to attenuate them.
- Mint narrow, short-lived capabilities per task. A task policy compiler turns the user's request into an authorization envelope — resources, tools, recipients, data classes, expiration — and a minting service issues per-task credentials against it. Cost: policy authoring and a minting service to operate. Does not cover: semantic abuse of capabilities the task legitimately needs.
- Enforce at a tool broker or MCP gateway, outside the model. The broker validates tool-call JSON (which is untrusted model output), pins tool descriptors and diffs changes, injects per-call credentials, strips ambient env, and routes risky tools into sandboxes. Cost: a real gateway component and the operational discipline to keep tool registration funneled through it. Does not cover: vulnerabilities inside the tool implementations themselves; those still need sandboxing.
- Apply information-flow control to compositions. Label data with confidentiality and integrity at ingest. Forbid low-integrity (untrusted-external) data from controlling high-impact actions. Forbid high-confidentiality data from reaching public sinks without explicit declassification. Cost: label plumbing across tools and retrieval, some utility loss, a declassification workflow. Does not cover: semantic laundering — for example, a model summarizing a secret without copying it verbatim — and covert channels.
- Sandbox untrusted computation and tool servers. Hardened containers (seccomp, AppArmor), gVisor, Firecracker microVMs, or WASM modules, chosen by compatibility need. Especially important for code execution, browser automation, and local stdio MCP servers. Cost: ops complexity, latency, image management. Does not cover: abuse of legitimate APIs through capabilities the agent has been granted — that is what capability control handles.
- Reserve human approval for risk transitions. Approval prompts must show provenance — data origins, capability, sink, credential — not just "Allow tool X?" Trigger on crossing trust zones, external sends, durable-state mutation, code execution, spend, or privilege escalation. Cost: UX work and the discipline to keep approvals rare enough to remain meaningful. Does not cover: approval fatigue if triggers are tuned too loose.
- Treat memory and retrieval as untrusted input channels. Quarantine memory writes that originate from externally influenced content. Carry origin and integrity labels on retrieval results. Keep privileged workflows from reading uncurated memory. Cost: a memory guard and tenancy discipline. Does not cover: poisoning detection, which is best-effort signal rather than a boundary.
The operational counterpart to this list — concrete checks an engineering team can run against an existing agent — lives in the tool-using agent hardening checklist.
Related concepts and tools
- Indirect prompt injection — the dominant input-side attack class that capability control exists to contain.
- Tool hijacking — the concrete failure mode when capability control is absent.
- Tool-using agent hardening checklist — operational controls a platform team can ship this quarter.
- Agentic AI security — the broader topic, including memory poisoning, MCP supply chain, and multi-agent risk.
- Source paper: sandboxing and capability control for tool-using autonomous agents — full taxonomy, deployment patterns, and references.
FAQ
How is capability control different from prompt injection defense?
Prompt-level defenses ask the model to behave under attacker-controlled input. Capability control removes the authority to misbehave in the first place. Detectors and instruction-hierarchy hardening are advisory and semantically defeatable; capabilities are enforced by a deterministic broker outside the model. Capability control assumes the planner is already compromised and constrains the outcome anyway.
What is ambient authority and why is it dangerous for AI agents?
Ambient authority is whatever the agent's host environment can already do — shell environment variables, mounted cloud credentials, an SSH agent, browser cookies, every registered MCP tool. The agent inherits all of it implicitly. Any prompt injection then runs at full user privilege. Object capabilities replace those ambient grants with explicit, narrow, expiring handles minted for one task.
Do I still need prompt-injection detection if I have capability control?
Yes, as telemetry and triage rather than as a boundary. Detectors catch known payload shapes, surface anomalies, and feed audit logs. They are useful for incident response and for tuning policy. They should never be the last gate before a payment, a deletion, an external send, or a privilege transition.
Where do I enforce capability control in a real agent stack?
At a tool broker or MCP gateway sitting outside the model. The broker validates tool-call arguments, enforces per-task capabilities, applies information-flow rules, mints short-lived downstream credentials, and routes risky tool implementations into sandboxes. The model proposes the action; the broker decides whether the capability is exercised.
Derived From
Related Work
External References
FAQ
How is capability control different from prompt injection defense?
Prompt-level defenses ask the model to behave under attacker-controlled input. Capability control removes the authority to misbehave in the first place. Detectors and instruction-hierarchy hardening are advisory and semantically defeatable; capabilities are enforced by a deterministic broker outside the model. Capability control assumes the planner is already compromised and constrains the outcome anyway.
What is ambient authority and why is it dangerous for AI agents?
Ambient authority is whatever the agent's host environment can already do — shell environment variables, mounted cloud credentials, an SSH agent, browser cookies, every registered MCP tool. The agent inherits all of it implicitly. Any prompt injection then runs at full user privilege. Object capabilities replace those ambient grants with explicit, narrow, expiring handles minted for one task.
Do I still need prompt-injection detection if I have capability control?
Yes, as telemetry and triage rather than as a boundary. Detectors catch known payload shapes, surface anomalies, and feed audit logs. They are useful for incident response and for tuning policy. They should never be the last gate before a payment, a deletion, an external send, or a privilege transition.
Where do I enforce capability control in a real agent stack?
At a tool broker or MCP gateway sitting outside the model. The broker validates tool-call arguments, enforces per-task capabilities, applies information-flow rules, mints short-lived downstream credentials, and routes risky tool implementations into sandboxes. The model proposes the action; the broker decides whether the capability is exercised.