AI SecurityChecklistMay 1, 2026Yellow — detail controls

Hardening Tool-Using Agents with Capability Control and Sandboxing

Quick Answer

This checklist hardens tool-using LLM agents, including MCP-based deployments, at the architectural layer where prompt-level defenses fail. It is for platform and security engineers who own the runtime, broker, and sandbox. Use it per release on agent platforms and quarterly on production deployments. It defends against indirect prompt injection, tool hijacking, and capability abuse by separating planning from authority, removing ambient authority, brokering short-lived capabilities, enforcing information flow, and sandboxing untrusted computation and tool servers.

This checklist is the architectural hardening playbook for tool-using LLM agents — capability minting, broker enforcement, sandbox profiles, MCP gateway requirements, and provenance logging. It assumes the planner is untrusted because untrusted text can influence which tool runs, with which arguments, against which data. For threat-model background, see what is indirect prompt injection and what is tool hijacking. For prompt-level mitigations that complement (but do not replace) the controls below, use the sibling tool-using agent hardening checklist. Some implementation specifics are abstracted to avoid mapping onto active vulnerabilities; the underlying source paper carries the full reference architecture.

Checks19 total11 MUST7 SHOULD1 NICE

How to use this checklist

Run this as a per-release gate on the agent platform itself (broker, sandbox profiles, policy compiler) and quarterly on each production deployment that uses it. Ownership sits with the platform security team, not individual agent feature teams. "Done" means every MUST is enforced in code or configuration with a passing fixture, every SHOULD has a documented control or a tracked exception, and a representative trajectory can be reconstructed end-to-end from the audit log. The full reference architecture and rationale live in the sandboxing and capability control paper.

Architecture and identity

3 checks

Separate the planner from authority

MUST

Why it matters

If model output can directly cause a side effect, every prompt-injection becomes an authorized action. The planner must propose; a deterministic runtime must decide.

How to implement

Route every tool call, memory write, and external send through a broker and policy engine. No SDK, plugin, or function-calling path may reach a sink without traversing it. Reject "tool dispatcher" libraries that bypass the broker for latency.

Verify it's done

A code search shows zero call sites that invoke a tool client outside the broker. A negative fixture that attempts a direct dispatch fails closed.

Remove ambient authority from agent runtimes

MUST

Why it matters

Inherited credentials and sockets are the most common path from a benign-looking tool call to host or cloud compromise. Ambient authority means the agent acts with whatever the surrounding process already had — exactly what an injected instruction will exploit.

How to implement

Build the agent base image with no inherited host env, no home-dir mount, no Docker or SSH agent socket, no shared browser cookies, and no developer cloud credentials. Start from deny and grant per-task.

Verify it's done

A process-inspection script run against a live agent container shows an empty environment except for explicitly granted variables, and no sensitive sockets are present.

Mint short-lived, task-scoped capabilities per session

MUST

Why it matters

Long-lived tokens issued to agents are blast-radius multipliers. Downstream-native scope is what an attacker actually has to defeat.

How to implement

Use GitHub fine-grained tokens, IAM session policies, database views, and object-store signed URLs where the downstream service supports them. Where it doesn't, emulate scope at the broker. TTLs measured in minutes, not hours.

Verify it's done

Token issuance logs show TTLs within policy and per-call audit references the originating task ID. Replay of an expired capability is denied.

Tool broker and MCP gateway

4 checks

Maintain a pinned tool inventory with descriptor hashing

MUST

Why it matters

Silent descriptor changes have been the entry point for several disclosed MCP supply-chain incidents. A new verb in a server you already trust is a new privilege.

How to implement

Track every server, version, content hash, owner, and granted scope in a registry. Block startup on hash drift. Semantic-diff added verbs (send, delete, execute, publish) and require human approval before exposing them to planners.

Verify it's done

A registry diff runs in CI; an unapproved descriptor change fails the build. Production agents refuse to load an unpinned server.

Validate every tool argument against schema and capability scope

MUST

Why it matters

Schema-valid JSON can still exceed what the current task is authorized to do. Argument validation against the live capability — not just the static tool definition — is what stops a hijacked planner.

How to implement

At the broker, evaluate each argument against the task's authorization envelope: allowed tools, allowed destinations, data classes, side-effect ceiling. Deny on mismatch and log the decision.

Verify it's done

Deny-fixture tests cover argument values that are valid by schema but out of scope for the test task; all return policy-deny with a structured reason.

Sanitize tool descriptors before exposing them to the planner

SHOULD

Why it matters

Descriptors are model-readable text and have been used to smuggle instructions into the planner context.

How to implement

Strip invisible Unicode, instruction-like phrasing, and embedded markdown links from descriptions. Render an admin-reviewed summary in the planner prompt, not the upstream description verbatim.

Verify it's done

A diff between the source manifest and the rendered descriptor shows the sanitization applied. A canary descriptor containing instruction-like content does not reach the planner unmodified.

Run local stdio MCP servers in a sandbox

MUST

Why it matters

Local MCP servers run as full-privilege child processes by default. Multiple disclosed advisories chain a poisoned input through a local server into command execution against the host.

How to implement

Launch each stdio server with no inherited env, a read-only project mount, a separate scratch volume, no home directory, no SSH/Docker/cloud sockets, and network disabled unless explicitly granted. Pin the package by hash.

Verify it's done

A launch-profile audit shows every stdio server matches the hardened profile. A sandbox-escape fixture fails to read host secrets or open outbound connections.

Information-flow control and data labels

3 checks

Label data on read with confidentiality, integrity, origin, and purpose

SHOULD

Why it matters

Without labels, the broker cannot tell a confidential record from a public one or a trusted instruction from a webpage scraped five tool calls ago. Labels are the substrate that makes information-flow control enforceable.

How to implement

Tag tool results at the broker boundary with confidentiality class, integrity (trust) level, origin (which source produced this token range), purpose, and lifetime. Propagate labels through planner context and into argument provenance.

Verify it's done

Tool-result envelopes in the audit log carry all five fields. A query joining label propagation across a multi-step task reconstructs origin for every argument.

Block low-integrity content from controlling high-impact actions

MUST

Why it matters

This is the load-bearing IFC rule. If untrusted-external taint can flow into the arguments of a code-execution, external-send, or admin-mutate tool, an indirect injection becomes an authorized action regardless of how the planner was trained.

How to implement

Classify a fixed set of high-impact sinks. Reject (or escalate to human approval) any call where any argument carries an untrusted-external taint. Do not allow the model to "explain away" the taint.

Verify it's done

Adversarial fixtures that route untrusted content into a high-impact argument all return deny or approval-required. A regression suite covers the disclosed MCP exploit classes by pattern, not by payload.

Require explicit, purpose-bound declassification for confidential-to-external flows

SHOULD

Why it matters

Quiet declassification is how data exfiltration looks normal in retrospect. Every confidential-to-external transition deserves a logged decision.

How to implement

When a confidential label must leave its zone, require a declassification record with approver, purpose, recipient, granularity, and expiration. Treat absence of a record as deny.

Verify it's done

The declassification log can be queried by recipient and purpose; every external write of confidential-class data has a matching record.

Sandboxing untrusted computation

3 checks

Use a strong isolation boundary for generated-code execution

MUST

Why it matters

Generated code is the highest-blast-radius artifact an agent produces. A container-only boundary is insufficient against a determined sandbox-escape attempt.

How to implement

Run code in gVisor, Firecracker microVMs, or an equivalent kernel-isolating runtime. Ephemeral, read-only input mount, separate writable scratch, no default network, no host secrets in env, hard CPU/memory/process/disk/wall-clock limits, deterministic teardown.

Verify it's done

Sandbox-escape regression tests covering filesystem, network, and credential exfiltration vectors all fail to escape. Resource-limit fixtures kill runaway jobs within bound.

Harden any container-only sandboxes that remain in use

SHOULD

Why it matters

Where a kernel-isolating runtime is not yet deployed, the container profile is the only thing between an injection and the host.

How to implement

Apply a tight seccomp profile, AppArmor or SELinux policy, user namespaces, dropped capabilities, read-only root, no Docker socket, and restricted egress. No privileged flag, ever.

Verify it's done

A profile linter compares the running container against the policy and fails on drift. The Docker socket is not present in any agent workload.

Isolate browser agents per task

SHOULD

Why it matters

Shared browser state is ambient authority by another name. Cookies and saved logins from one task become tools for an attacker on the next.

How to implement

Fresh profile per task, no default cookies, origin allowlists, download quarantine, upload restrictions, separate observe and act phases, and form-submit confirmation on sensitive domains.

Verify it's done

Per-task lifecycle logs show profile creation and teardown. A second task launched on the same worker has no access to the first task's session.

Memory and retrieval guard

2 checks

Authorize and label every memory write; quarantine externally influenced memories

SHOULD

Why it matters

Memory is a delayed-action injection channel. Content read from an untrusted source today can steer a privileged workflow tomorrow.

How to implement

Label memory entries with origin and integrity. Stage entries derived from untrusted content in a quarantine area; privileged workflows cannot read staged memories until human review or deterministic extraction.

Verify it's done

Memory provenance fields are populated for every entry. A privileged-workflow fixture that attempts to read a staged memory is denied.

Partition memory and retrieval by tenant, user, and workflow

MUST

Why it matters

Cross-tenant retrieval is the worst-case outcome of a misconfigured RAG layer and has been the root cause of multiple disclosed agent data-exposure incidents.

How to implement

Enforce retrieval ACLs at the index layer, not the prompt layer. Partition by tenant and user; scope retrieval by workflow. Set expirations on persistent entries.

Verify it's done

Retrieval-ACL fixtures attempting cross-tenant and cross-user reads return empty. An expired entry is unreachable.

Egress, approval, and audit

4 checks

Treat every external write as a sink with destination allowlists and IFC checks

MUST

Why it matters

Exfiltration channels are creative — image URLs, DNS, package names, commit messages, file names, and browser navigation have all been used in disclosed agent attacks. Email and chat are not the only sinks.

How to implement

Define the full sink set. Apply destination allowlists and information-flow checks uniformly. Scan content for secrets and PII before release. Treat side-channel surfaces (URLs the model constructs, names it chooses) as sinks too.

Verify it's done

Sink-policy fixtures cover each channel and fail closed on disallowed destinations. A canary secret introduced into context does not appear in any external write.

Reserve human approval for risk transitions, not every call

SHOULD

Why it matters

Approval fatigue kills the control. Approvers must see the calls that matter and have the context to decide.

How to implement

Trigger approval on cross-trust-zone moves, durable mutation, code execution, external send, money movement, and privilege escalation. Show the approver the arguments, data origins, capability grant, and downstream effect — not just the tool name.

Verify it's done

An approval-coverage report shows every risk-transition call routed through approval and routine calls auto-allowed. Approver UI screenshots include provenance.

Log full provenance per action

MUST

Why it matters

When something goes wrong, incident response needs to distinguish a model mistake from an injection-driven flow. Without provenance, every incident becomes "the model did a bad thing" and root cause is unrecoverable.

How to implement

For every action, record source data origins, capability grant, policy decision, sink destination, approver (if any), sandbox ID, and a redacted content hash. Make the schema queryable end-to-end.

Verify it's done

A representative trajectory — user request through tool calls to external write — can be reconstructed from logs alone, including which upstream tokens influenced which arguments.

Run adaptive red-team scenarios in CI and pre-prod

NICE

Why it matters

Static fixtures decay. Published benchmarks (AgentDojo, InjecAgent and successors) report attack success rates against frontier models high enough that prompt-level hardening alone is empirically inadequate.

How to implement

Maintain a scenario set that exercises indirect-injection patterns and the disclosed MCP exploit classes. Run on every broker or policy change. Treat monitor outputs as advisory; the gating signal is whether deterministic controls held.

Verify it's done

The scenario set has documented pass/fail thresholds. A failing scenario blocks merge to the broker or policy engine.

Acceptance criteria

The checklist is fully implemented when the planner cannot reach any tool, memory write, or external sink except through the broker; when no agent runtime carries inherited host credentials, sockets, or long-lived tokens; when every tool argument is checked against a task-scoped capability and every high-impact call is denied or escalated when an argument carries untrusted-external taint; when generated code runs in a kernel-isolating sandbox with no default network or host secrets; when every external write traverses a sink policy with destination allowlists and IFC checks; and when a representative trajectory can be reconstructed end-to-end from provenance logs alone. A platform that meets these conditions converts indirect prompt injection from authorized action into a denied tool call, a blocked egress attempt, a quarantined memory, or an approval request with full provenance.

Hardening Tool-Using Agents with Capability Control and Sandboxing

How to use this checklist

Architecture and identity

Separate the planner from authority

Remove ambient authority from agent runtimes

Mint short-lived, task-scoped capabilities per session

Tool broker and MCP gateway

Maintain a pinned tool inventory with descriptor hashing

Validate every tool argument against schema and capability scope

Sanitize tool descriptors before exposing them to the planner

Run local stdio MCP servers in a sandbox

Information-flow control and data labels

Label data on read with confidentiality, integrity, origin, and purpose

Block low-integrity content from controlling high-impact actions

Require explicit, purpose-bound declassification for confidential-to-external flows

Sandboxing untrusted computation

Use a strong isolation boundary for generated-code execution

Harden any container-only sandboxes that remain in use

Isolate browser agents per task

Memory and retrieval guard

Authorize and label every memory write; quarantine externally influenced memories

Partition memory and retrieval by tenant, user, and workflow

Egress, approval, and audit

Treat every external write as a sink with destination allowlists and IFC checks

Reserve human approval for risk transitions, not every call

Log full provenance per action

Run adaptive red-team scenarios in CI and pre-prod

Acceptance criteria

Derived From

Related Work

External References