Hardening Tool-Using Agents with Capability Control and Sandboxing
Quick Answer
This checklist hardens tool-using LLM agents, including MCP-based deployments, at the architectural layer where prompt-level defenses fail. It is for platform and security engineers who own the runtime, broker, and sandbox. Use it per release on agent platforms and quarterly on production deployments. It defends against indirect prompt injection, tool hijacking, and capability abuse by separating planning from authority, removing ambient authority, brokering short-lived capabilities, enforcing information flow, and sandboxing untrusted computation and tool servers.
This checklist is the architectural hardening playbook for tool-using LLM agents — capability minting, broker enforcement, sandbox profiles, MCP gateway requirements, and provenance logging. It assumes the planner is untrusted because untrusted text can influence which tool runs, with which arguments, against which data. For threat-model background, see what is indirect prompt injection and what is tool hijacking. For prompt-level mitigations that complement (but do not replace) the controls below, use the sibling tool-using agent hardening checklist. Some implementation specifics are abstracted to avoid mapping onto active vulnerabilities; the underlying source paper carries the full reference architecture.
How to use this checklist
Run this as a per-release gate on the agent platform itself (broker, sandbox profiles, policy compiler) and quarterly on each production deployment that uses it. Ownership sits with the platform security team, not individual agent feature teams. "Done" means every MUST is enforced in code or configuration with a passing fixture, every SHOULD has a documented control or a tracked exception, and a representative trajectory can be reconstructed end-to-end from the audit log. The full reference architecture and rationale live in the sandboxing and capability control paper.
Architecture and identity
3 checksSeparate the planner from authority
MUSTWhy it matters
If model output can directly cause a side effect, every prompt-injection becomes an authorized action. The planner must propose; a deterministic runtime must decide.
How to implement
Route every tool call, memory write, and external send through a broker and policy engine. No SDK, plugin, or function-calling path may reach a sink without traversing it. Reject "tool dispatcher" libraries that bypass the broker for latency.
Verify it's done
A code search shows zero call sites that invoke a tool client outside the broker. A negative fixture that attempts a direct dispatch fails closed.
Remove ambient authority from agent runtimes
MUSTWhy it matters
Inherited credentials and sockets are the most common path from a benign-looking tool call to host or cloud compromise. Ambient authority means the agent acts with whatever the surrounding process already had — exactly what an injected instruction will exploit.
How to implement
Build the agent base image with no inherited host env, no home-dir mount, no Docker or SSH agent socket, no shared browser cookies, and no developer cloud credentials. Start from deny and grant per-task.
Verify it's done
A process-inspection script run against a live agent container shows an empty environment except for explicitly granted variables, and no sensitive sockets are present.
Mint short-lived, task-scoped capabilities per session
MUSTWhy it matters
Long-lived tokens issued to agents are blast-radius multipliers. Downstream-native scope is what an attacker actually has to defeat.
How to implement
Use GitHub fine-grained tokens, IAM session policies, database views, and object-store signed URLs where the downstream service supports them. Where it doesn't, emulate scope at the broker. TTLs measured in minutes, not hours.
Verify it's done
Token issuance logs show TTLs within policy and per-call audit references the originating task ID. Replay of an expired capability is denied.
Tool broker and MCP gateway
4 checksMaintain a pinned tool inventory with descriptor hashing
MUSTWhy it matters
Silent descriptor changes have been the entry point for several disclosed MCP supply-chain incidents. A new verb in a server you already trust is a new privilege.
How to implement
Track every server, version, content hash, owner, and granted scope in a registry. Block startup on hash drift. Semantic-diff added verbs (send, delete, execute, publish) and require human approval before exposing them to planners.
Verify it's done
A registry diff runs in CI; an unapproved descriptor change fails the build. Production agents refuse to load an unpinned server.
Validate every tool argument against schema and capability scope
MUSTWhy it matters
Schema-valid JSON can still exceed what the current task is authorized to do. Argument validation against the live capability — not just the static tool definition — is what stops a hijacked planner.
How to implement
At the broker, evaluate each argument against the task's authorization envelope: allowed tools, allowed destinations, data classes, side-effect ceiling. Deny on mismatch and log the decision.
Verify it's done
Deny-fixture tests cover argument values that are valid by schema but out of scope for the test task; all return policy-deny with a structured reason.
Sanitize tool descriptors before exposing them to the planner
SHOULDWhy it matters
Descriptors are model-readable text and have been used to smuggle instructions into the planner context.
How to implement
Strip invisible Unicode, instruction-like phrasing, and embedded markdown links from descriptions. Render an admin-reviewed summary in the planner prompt, not the upstream description verbatim.
Verify it's done
A diff between the source manifest and the rendered descriptor shows the sanitization applied. A canary descriptor containing instruction-like content does not reach the planner unmodified.
Run local stdio MCP servers in a sandbox
MUSTWhy it matters
Local MCP servers run as full-privilege child processes by default. Multiple disclosed advisories chain a poisoned input through a local server into command execution against the host.
How to implement
Launch each stdio server with no inherited env, a read-only project mount, a separate scratch volume, no home directory, no SSH/Docker/cloud sockets, and network disabled unless explicitly granted. Pin the package by hash.
Verify it's done
A launch-profile audit shows every stdio server matches the hardened profile. A sandbox-escape fixture fails to read host secrets or open outbound connections.
Information-flow control and data labels
3 checksLabel data on read with confidentiality, integrity, origin, and purpose
SHOULDWhy it matters
Without labels, the broker cannot tell a confidential record from a public one or a trusted instruction from a webpage scraped five tool calls ago. Labels are the substrate that makes information-flow control enforceable.
How to implement
Tag tool results at the broker boundary with confidentiality class, integrity (trust) level, origin (which source produced this token range), purpose, and lifetime. Propagate labels through planner context and into argument provenance.
Verify it's done
Tool-result envelopes in the audit log carry all five fields. A query joining label propagation across a multi-step task reconstructs origin for every argument.
Block low-integrity content from controlling high-impact actions
MUSTWhy it matters
This is the load-bearing IFC rule. If untrusted-external taint can flow into the arguments of a code-execution, external-send, or admin-mutate tool, an indirect injection becomes an authorized action regardless of how the planner was trained.
How to implement
Classify a fixed set of high-impact sinks. Reject (or escalate to human approval) any call where any argument carries an untrusted-external taint. Do not allow the model to "explain away" the taint.
Verify it's done
Adversarial fixtures that route untrusted content into a high-impact argument all return deny or approval-required. A regression suite covers the disclosed MCP exploit classes by pattern, not by payload.
Require explicit, purpose-bound declassification for confidential-to-external flows
SHOULDWhy it matters
Quiet declassification is how data exfiltration looks normal in retrospect. Every confidential-to-external transition deserves a logged decision.
How to implement
When a confidential label must leave its zone, require a declassification record with approver, purpose, recipient, granularity, and expiration. Treat absence of a record as deny.
Verify it's done
The declassification log can be queried by recipient and purpose; every external write of confidential-class data has a matching record.
Sandboxing untrusted computation
3 checksUse a strong isolation boundary for generated-code execution
MUSTWhy it matters
Generated code is the highest-blast-radius artifact an agent produces. A container-only boundary is insufficient against a determined sandbox-escape attempt.
How to implement
Run code in gVisor, Firecracker microVMs, or an equivalent kernel-isolating runtime. Ephemeral, read-only input mount, separate writable scratch, no default network, no host secrets in env, hard CPU/memory/process/disk/wall-clock limits, deterministic teardown.
Verify it's done
Sandbox-escape regression tests covering filesystem, network, and credential exfiltration vectors all fail to escape. Resource-limit fixtures kill runaway jobs within bound.
Harden any container-only sandboxes that remain in use
SHOULDWhy it matters
Where a kernel-isolating runtime is not yet deployed, the container profile is the only thing between an injection and the host.
How to implement
Apply a tight seccomp profile, AppArmor or SELinux policy, user namespaces, dropped capabilities, read-only root, no Docker socket, and restricted egress. No privileged flag, ever.
Verify it's done
A profile linter compares the running container against the policy and fails on drift. The Docker socket is not present in any agent workload.
Isolate browser agents per task
SHOULDWhy it matters
Shared browser state is ambient authority by another name. Cookies and saved logins from one task become tools for an attacker on the next.
How to implement
Fresh profile per task, no default cookies, origin allowlists, download quarantine, upload restrictions, separate observe and act phases, and form-submit confirmation on sensitive domains.
Verify it's done
Per-task lifecycle logs show profile creation and teardown. A second task launched on the same worker has no access to the first task's session.
Memory and retrieval guard
2 checksAuthorize and label every memory write; quarantine externally influenced memories
SHOULDWhy it matters
Memory is a delayed-action injection channel. Content read from an untrusted source today can steer a privileged workflow tomorrow.
How to implement
Label memory entries with origin and integrity. Stage entries derived from untrusted content in a quarantine area; privileged workflows cannot read staged memories until human review or deterministic extraction.
Verify it's done
Memory provenance fields are populated for every entry. A privileged-workflow fixture that attempts to read a staged memory is denied.
Partition memory and retrieval by tenant, user, and workflow
MUSTWhy it matters
Cross-tenant retrieval is the worst-case outcome of a misconfigured RAG layer and has been the root cause of multiple disclosed agent data-exposure incidents.
How to implement
Enforce retrieval ACLs at the index layer, not the prompt layer. Partition by tenant and user; scope retrieval by workflow. Set expirations on persistent entries.
Verify it's done
Retrieval-ACL fixtures attempting cross-tenant and cross-user reads return empty. An expired entry is unreachable.
Egress, approval, and audit
4 checksTreat every external write as a sink with destination allowlists and IFC checks
MUSTWhy it matters
Exfiltration channels are creative — image URLs, DNS, package names, commit messages, file names, and browser navigation have all been used in disclosed agent attacks. Email and chat are not the only sinks.
How to implement
Define the full sink set. Apply destination allowlists and information-flow checks uniformly. Scan content for secrets and PII before release. Treat side-channel surfaces (URLs the model constructs, names it chooses) as sinks too.
Verify it's done
Sink-policy fixtures cover each channel and fail closed on disallowed destinations. A canary secret introduced into context does not appear in any external write.
Reserve human approval for risk transitions, not every call
SHOULDWhy it matters
Approval fatigue kills the control. Approvers must see the calls that matter and have the context to decide.
How to implement
Trigger approval on cross-trust-zone moves, durable mutation, code execution, external send, money movement, and privilege escalation. Show the approver the arguments, data origins, capability grant, and downstream effect — not just the tool name.
Verify it's done
An approval-coverage report shows every risk-transition call routed through approval and routine calls auto-allowed. Approver UI screenshots include provenance.
Log full provenance per action
MUSTWhy it matters
When something goes wrong, incident response needs to distinguish a model mistake from an injection-driven flow. Without provenance, every incident becomes "the model did a bad thing" and root cause is unrecoverable.
How to implement
For every action, record source data origins, capability grant, policy decision, sink destination, approver (if any), sandbox ID, and a redacted content hash. Make the schema queryable end-to-end.
Verify it's done
A representative trajectory — user request through tool calls to external write — can be reconstructed from logs alone, including which upstream tokens influenced which arguments.
Run adaptive red-team scenarios in CI and pre-prod
NICEWhy it matters
Static fixtures decay. Published benchmarks (AgentDojo, InjecAgent and successors) report attack success rates against frontier models high enough that prompt-level hardening alone is empirically inadequate.
How to implement
Maintain a scenario set that exercises indirect-injection patterns and the disclosed MCP exploit classes. Run on every broker or policy change. Treat monitor outputs as advisory; the gating signal is whether deterministic controls held.
Verify it's done
The scenario set has documented pass/fail thresholds. A failing scenario blocks merge to the broker or policy engine.
Acceptance criteria
The checklist is fully implemented when the planner cannot reach any tool, memory write, or external sink except through the broker; when no agent runtime carries inherited host credentials, sockets, or long-lived tokens; when every tool argument is checked against a task-scoped capability and every high-impact call is denied or escalated when an argument carries untrusted-external taint; when generated code runs in a kernel-isolating sandbox with no default network or host secrets; when every external write traverses a sink policy with destination allowlists and IFC checks; and when a representative trajectory can be reconstructed end-to-end from provenance logs alone. A platform that meets these conditions converts indirect prompt injection from authorized action into a denied tool call, a blocked egress attempt, a quarantined memory, or an approval request with full provenance.