AI SecurityChecklistMay 1, 2026Yellow — detail controls

Hardening Tool-Using LLM Agents Against Prompt Injection

Quick Answer

This checklist hardens single-agent tool-using LLM deployments — MCP clients, Copilot-style assistants, browsing agents, and code agents — against prompt injection and indirect prompt injection. It targets engineers and security reviewers running pre-launch or quarterly reviews. Use it to audit capability planning, untrusted-content ingestion, tool-call validation, memory and RAG provenance, and detection. Some reproduction details are withheld; defenses are framed at the architectural level.

This checklist hardens a single-agent tool-using LLM deployment — an MCP client, Copilot-style assistant, browsing agent, or code agent — against the confused-deputy failure mode at the heart of indirect prompt injection. The audience is the engineering team shipping the system and the security reviewer running a pre-launch or quarterly review. The checks below treat the agent as a constrained distributed system: the model proposes actions, deterministic components authorize them. Some reproduction-level detail is withheld; defenses are framed architecturally.

Checks19 total9 MUST9 SHOULD1 NICE

How to use this checklist

Run it once before launch, then re-run on every release that adds a tool, a data source, or a model swap, and at minimum quarterly. Each check has an owner — typically a platform or agent-runtime engineer — and a verification signal a reviewer can read without privileged access. The reference architecture and empirical baselines are drawn from the source paper on tool-using agent security; read it first if any check below feels arbitrary. "Done" means every MUST is verified in production and every SHOULD has either a verified control or a documented, time-bounded exception.

Capability planning and tool design

4 checks

Pre-commit the capability manifest before any untrusted content enters context

MUST

Why it matters

Once an untrusted span is in the context window, every downstream "decision" by the model is suspect. The only way to keep authorization decisions trustworthy is to make them before the agent reads attacker-controlled text. This is the tool-precommitment pattern.

How to implement

A trusted planner sees only the user request and developer policy, and emits a structured manifest: allowed tools, parameter scopes, max result counts, allowed destinations, write permissions, and consent thresholds. The runtime treats this manifest as immutable for the session.

Verify it's done

Read a session trace and confirm the manifest is fully populated and timestamped before the first untrusted-origin token enters any model's context.

Split tools so no single tool spans a read/write trust boundary

MUST

Why it matters

Generic catch-all tools like gmail_api or github_api collapse the read-private / write-external boundary that egress control depends on. A narrowed tool surface makes the validator's job tractable.

How to implement

Replace broad tools with narrow commands — email_search_readonly, email_draft_reply (no auto-send), email_send_external (step-up auth and DLP). Apply the same split to repos, databases, and chat surfaces.

Verify it's done

Enumerate the registered tools and confirm no single tool both reads private data and emits to an external sink.

Scope OAuth and service credentials to the per-task minimum

MUST

Why it matters

A tool's blast radius is the union of its credential's scopes. Long-lived broad tokens turn a low-severity injection into account-wide compromise.

How to implement

Issue per-task ephemeral credentials where the upstream supports it; otherwise pin the narrowest stable scope per tool entry. Bind credentials to the manifest, not the agent process.

Verify it's done

For each tool, confirm the granted scopes are no broader than its manifest entry requires, and that grants are reviewed on a fixed cadence.

Forbid the model from selecting or registering new tools at runtime

SHOULD

Why it matters

Runtime tool registration is the cleanest path to tool hijacking: an attacker-controlled span persuades the model to mint a tool that bypasses every other control.

How to implement

Tool addition requires out-of-band human approval and a code-reviewed manifest change. The runtime rejects model-proposed tool registration, including any "install this MCP server" affordance.

Verify it's done

Inject a synthetic "register tool" instruction into a test corpus and confirm the active toolset does not change and that the attempt is logged.

Untrusted-content ingestion and context isolation

4 checks

Route every untrusted source through a quarantine path with provenance and taint labels

MUST

Why it matters

You cannot enforce a taint-flow policy without taint labels. Ingestion is the only place to apply them cheaply and consistently.

How to implement

Web fetches, email bodies, MCP responses, retrieved chunks, tool outputs, and uploaded files all carry origin, trust, and received_at end-to-end. Labels travel with the span through extraction, summarization, and retrieval.

Verify it's done

Inspect any production context payload and confirm every span has a provenance tag traceable to a single ingestion event.

Use typed extractors instead of raw-text passthrough wherever the task allows

SHOULD

Why it matters

Raw text from an untrusted source is a payload-delivery surface. Typed extracted fields are not — they cannot carry imperative instructions to the planner.

How to implement

Return {invoice_total, vendor, due_date} rather than the full PDF body; return a parsed thread summary rather than full email HTML. Extractors run in reader roles with no tool authority.

Verify it's done

Sample production sessions and measure the fraction of untrusted-origin tokens that reach the planner versus stopping at an extractor; track this ratio over releases.

Strip rendered-invisible content at ingestion

SHOULD

Why it matters

Zero-width characters, white-on-white text, hidden DOM nodes, and alt-text-only payloads are routine carriers for indirect injection. They are invisible to humans reviewing the source but fully visible to the model.

How to implement

Normalize ingested HTML, Markdown, and PDF text through a sanitizer that drops or flags hidden content; surface flagged spans to logs rather than to the model.

Verify it's done

Feed a known set of hidden-text fixtures through ingestion and confirm they are removed or surfaced with a hidden_content flag.

Separate the model that reads untrusted content from the model that authorizes actions

SHOULD

Why it matters

A single model instance reading attacker-controlled text and choosing tool calls is the textbook confused deputy. Splitting roles makes the privileged decision-maker blind to the payload.

How to implement

Distinct planner / reader / actor roles. The reader has no tool authority; the actor receives only typed extracted data plus the precommitted manifest, never the raw untrusted span.

Verify it's done

Audit the session graph and confirm the actor's context contains no spans labeled trust: untrusted.

Tool-call validation and egress control

4 checks

Validate every tool call against the manifest in a deterministic policy engine that fails closed

MUST

Why it matters

Validation inside the model is not validation. The model is the thing being attacked. The check has to live somewhere the attacker cannot reach with text.

How to implement

A non-LLM policy engine checks: tool allowed, parameters match schema, destination on allowlist, data volume within limits, taint flow allowed, write requires consent. Unknown conditions deny.

Verify it's done

Run a red-team harness of out-of-manifest calls and confirm every one is denied and logged with a structured reason.

Block exfiltration sinks deterministically, including rendered Markdown images and auto-fetched URLs

MUST

Why it matters

The EchoLeak chain (CVE-2025-32711) showed that a stack of plausible mitigations still produced zero-click exfiltration because rendered content was treated as output, not as egress. Anything the client fetches on the user's behalf is an egress channel.

How to implement

The validator and the rendering layer enforce the egress allowlist on Markdown image sources, link auto-fetches, embed URLs, webhook destinations, and DNS-style subdomain patterns — not just on explicit send tools.

Verify it's done

Run canary-secret tests (see Detection) that try to exfiltrate via image rendering, link prefetch, and proxy paths; confirm none reach the network.

Require explicit, structured user consent for every external send, write, or irreversible action

MUST

Why it matters

Consent prose generated by the model is part of the attack surface, not the defense. The user must see ground-truth structured fields, not a paraphrase the attacker influenced.

How to implement

Generate consent dialogs from runtime structured data — action, destination, data fields, source provenance, whether untrusted content influenced the call — using a non-LLM template. Step-up auth on high-risk classes.

Verify it's done

Confirm the consent payload schema is populated by the runtime, not by model output, and that altering provenance flips the dialog.

Enforce egress allowlists on outbound HTTP, including allowed corporate proxies

SHOULD

Why it matters

A corporate proxy is still a network egress. Allowlists scoped only to "external" domains miss the proxy path, which is exactly how the EchoLeak chain reached its sink.

How to implement

Allowlists are per-task and scoped by the manifest; any internal proxy that fetches arbitrary URLs is treated as an external destination for allowlist purposes.

Verify it's done

Attempt an out-of-allowlist fetch in a test session and confirm the validator denies it before the request leaves the process.

Memory and RAG provenance

3 checks

Tag every memory record with source, trust, allowed_uses, and forbidden_uses

MUST

Why it matters

Long-term memory is the durable surface for memory poisoning: a single bad write becomes a behavioral default until evicted. Untrusted-origin records must never be read as instructions.

How to implement

Memory schema includes source, trust, allowed_uses, forbidden_uses. Untrusted-source records default to forbidden_uses: [authorization, tool_selection] and the runtime honors that.

Verify it's done

Sample memory records and confirm schema completeness; attempt to use an untrusted-source record to grant tool authority and confirm the runtime refuses.

Authenticate RAG corpus writers and quarantine new documents before high-risk use

SHOULD

Why it matters

Unauthenticated write paths into a retrieval corpus are how RAG data exfiltration and policy-override attacks land. Public-web indexes and internal-policy corpora cannot share a trust class.

How to implement

Every index write is authenticated and attributed; new documents enter a quarantine state and are excluded from high-risk task retrieval until reviewed or aged.

Verify it's done

Walk the write path for each index and confirm there is no anonymous or service-account-shared ingestion route.

Prevent retrieved chunks from granting tools or overriding policy at the runtime level

SHOULD

Why it matters

The model treats convincing retrieved text as authoritative. The runtime must not.

How to implement

Retrieval returns typed evidence with citations; the validator ignores any tool-permission claim, scope expansion, or policy-override claim that originates in retrieved content. Manifest changes require the planner path, not the retrieval path.

Verify it's done

Inject a test chunk requesting new tool authority and confirm no manifest change occurs and the attempt is logged.

Detection, audit, and adaptive evaluation

4 checks

Log prompts, tool outputs, proposed calls, validator decisions, and consent prompts with provenance

MUST

Why it matters

Without provenance-linked logs you cannot answer the only question that matters during an incident: which untrusted span caused which action.

How to implement

Structured logs retain prompts, tool I/O, proposed calls, validator allow/deny reasons, consent payloads, and denied actions, with span-level provenance. Retention long enough for incident response.

Verify it's done

Run a tabletop where a tester picks an action from a recorded session and traces it back to a single source span using only the logs.

Plant canary secrets and alert on any egress path

SHOULD

Why it matters

Canaries catch exfiltration channels you forgot to enumerate, including the rendered-image and proxy paths the EchoLeak chain abused.

How to implement

Place unique canary tokens in private contexts, memory, and retrieval corpora. Alert on any outbound observation of a canary — including via image fetches, encoded URLs, DNS queries, and proxy paths.

Verify it's done

Scheduled canary-egress tests run against a staging tenant and alerts fire on each known channel.

Evaluate defenses with adaptive attacks and report utility-under-attack alongside benign utility

SHOULD

Why it matters

Static-payload "0% ASR" claims are not evidence of robustness. AgentDojo, ASB, and MCPTox baselines show realistic systems sit far from zero, and adaptive attackers move the number further.

How to implement

Maintain an internal red-team suite that updates per release; report benign utility, utility-under-attack, and targeted attack success rate. Treat regressions in any of the three as release-blocking.

Verify it's done

Each release has a published evaluation row with all three metrics and a delta against the prior release.

Govern MCP tool metadata as supply chain

NICE

Why it matters

MCPTox and related work show tool descriptions are themselves an injection surface; an unreviewed description update can quietly redefine what a tool does in the model's eyes.

How to implement

Pin MCP server versions by hash, sign manifests, review natural-language description diffs in code review, and prohibit imperative instructions in tool descriptions.

Verify it's done

Inventory the MCP servers in use and confirm every entry has a pinned hash, a reviewed description diff, and an owner.

Acceptance criteria

The checklist is fully implemented when: every tool call in production is authorized by a non-LLM validator against a manifest committed before untrusted content entered context; no single registered tool spans a read-private / write-external boundary; every untrusted span carries provenance from ingestion through logging; rendered Markdown, auto-fetched URLs, and corporate proxies are governed by the same egress allowlist as explicit send tools; canary secrets in private contexts do not leak under scheduled tests; and each release ships with adaptive-evaluation metrics covering benign utility, utility-under-attack, and targeted ASR. For multi-agent or agent-to-agent topologies, pair this with the multi-agent prompt injection defense checklist, which covers cross-agent trust boundaries this checklist does not.

Derived From

Research Notesecurity tool using llm agent security and prompt injection defenses

Hardening Tool-Using LLM Agents Against Prompt Injection

How to use this checklist

Capability planning and tool design

Pre-commit the capability manifest before any untrusted content enters context

Split tools so no single tool spans a read/write trust boundary

Scope OAuth and service credentials to the per-task minimum

Forbid the model from selecting or registering new tools at runtime

Untrusted-content ingestion and context isolation

Route every untrusted source through a quarantine path with provenance and taint labels

Use typed extractors instead of raw-text passthrough wherever the task allows

Strip rendered-invisible content at ingestion

Separate the model that reads untrusted content from the model that authorizes actions

Tool-call validation and egress control

Validate every tool call against the manifest in a deterministic policy engine that fails closed

Block exfiltration sinks deterministically, including rendered Markdown images and auto-fetched URLs

Require explicit, structured user consent for every external send, write, or irreversible action

Enforce egress allowlists on outbound HTTP, including allowed corporate proxies

Memory and RAG provenance

Tag every memory record with source, trust, allowed_uses, and forbidden_uses

Authenticate RAG corpus writers and quarantine new documents before high-risk use

Prevent retrieved chunks from granting tools or overriding policy at the runtime level

Detection, audit, and adaptive evaluation

Log prompts, tool outputs, proposed calls, validator decisions, and consent prompts with provenance

Plant canary secrets and alert on any egress path

Evaluate defenses with adaptive attacks and report utility-under-attack alongside benign utility

Govern MCP tool metadata as supply chain

Acceptance criteria

Derived From

Related Work

External References