Tool-Using LLM Agent Security and Prompt-Injection Defenses

Tool-using LLM agents turn natural-language model outputs into API calls, browser actions, database reads, file edits, emails, payments, code execution, and long-lived memory writes. Prompt injection is the attack class that exploits the missing isolation boundary between instructions and data: an attacker places instruction-shaped content in a channel the agent treats as data—web pages, emails, Slack messages, tool outputs, retrieval documents, MCP tool metadata, PDFs, images, logs, or memory—and the model interprets it as operational authority. This matters now because agent deployments have crossed the threshold from chat to delegated action. OWASP lists prompt injection as LLM01:2025, and Microsoft reports indirect prompt injection as one of the most widely used techniques in AI security vulnerabilities reported to MSRC; the impacts include data exfiltration and unintended actions under the user’s credentials (OWASP LLM01, 2025; Microsoft MSRC, 2025). (genai.owasp.org) The empirical picture is severe: AgentDojo evaluates 97 realistic tasks and 629 security cases and reports GPT-4o at 69.0% benign utility but 57.69% targeted attack success under its strongest evaluated attack without defenses; even its effective tool-filter defense leaves 6.84% targeted ASR and 56.28% utility under attack (AgentDojo, NeurIPS 2024). (arxiv.org) Agent Security Bench covers 10 scenarios, more than 400 tools, 27 attack/defense types, and 13 model backbones, with the highest average attack success rate reported at 84.30% (ASB, ICLR 2025). (arxiv.org) Real incidents have arrived: EchoLeak, assigned CVE-2025-32711, showed zero-click data exfiltration from Microsoft 365 Copilot via a crafted email and a chain of XPIA-filter, Markdown-link, auto-fetch, and CSP-proxy bypasses (EchoLeak, AAAI FSS 2025). (arxiv.org) MCP widened the surface further: MCPTox built 1,312 malicious test cases over 45 live MCP servers and 353 tools, finding tool-poisoning ASR up to 72.8% and refusal rates below 3% for the most refusing model (MCPTox, 2025). (arxiv.org) This paper delivers a threat model, empirical state of the art, defense taxonomy, practical architecture, evaluation playbook, and open research agenda for building safer tool-using agents today.

The Core Security Failure: Authority Collapse in a Text Channel

Prompt injection is not “a bad prompt.” It is an authority-confusion vulnerability. Classical software systems maintain separations between code and data, principals and resources, request and policy, user input and privileged execution. Tool-using LLM agents frequently collapse those separations into one model context. The same next-token engine reads the developer policy, the user’s task, untrusted retrieved text, previous tool observations, tool schemas, hidden metadata, and its own scratchpad. It is then asked to infer which text is binding.

That design produces a new confused-deputy pattern. The user delegates authority to the agent; the agent fetches untrusted content; the untrusted content speaks in the imperative; the agent performs actions using the user’s or service’s credentials. The attacker does not need arbitrary code execution in the conventional sense. They need a path by which natural language becomes a tool call, and a tool call becomes an effect.

The decisive security question is therefore not “can the model recognize malicious text?” It is “what can untrusted text cause?” If untrusted text can cause writes, network egress, credential use, database queries beyond the user’s intent, memory updates, or tool selection changes, the system is vulnerable even when the prompt contains warnings to “ignore malicious instructions.” Microsoft’s public defense guidance makes the same distinction: probabilistic defenses such as hardened prompts, Spotlighting, and classifiers can reduce likelihood, while deterministic defenses block impacts through data governance, explicit consent, and hard restrictions on exfiltration paths (Microsoft MSRC, 2025). (microsoft.com)

The agent setting differs from chatbot jailbreaking in three ways. First, success is externally observable: the attacker’s goal is a tool call, state change, leak, or transaction, not merely a harmful answer. Second, attack payloads often arrive indirectly through channels the victim did not author. Third, the attack can be multi-step: a payload can be retrieved, summarized, stored in memory, propagated through a tool result, or embedded in tool metadata before it triggers. This makes prompt injection a systems-security problem with language-model components, not an NLP moderation problem.

Threat Model and Attack Taxonomy

Principals, Assets, and Trust Boundaries

A useful agent threat model starts with principals:

User principal: the human delegating a task.
Application principal: the agent service, often holding API keys, OAuth tokens, browser session cookies, or cloud IAM permissions.
Tool principal: the service behind a tool, such as email, Slack, GitHub, Salesforce, a shell, a database, or an MCP server.
Content principal: the author of data the agent reads.
Model principal: the LLM provider and model runtime, which may enforce its own safety policy but cannot be treated as an application security boundary.

Assets include secrets in context, private retrieved documents, credentials reachable through tools, integrity of actions, stored memory, audit logs, and the user’s decision authority. Trust boundaries occur at every ingestion and egress edge: user prompt, system prompt, retrieved document, tool response, tool schema, memory load, model output, tool call, network request, and rendered response.

Most deployed failures come from one of three trust-boundary mistakes:

Untrusted content is placed in the same context as trusted instructions without a binding label the model reliably enforces.
The model is allowed to choose or parameterize consequential tools after reading untrusted content.
The runtime treats model intent as authorization instead of validating it against an independent policy.

Direct Prompt Injection

Direct injection occurs when the attacker is the user. It includes instruction override (“ignore previous instructions”), role impersonation (“system: new policy”), prompt extraction, refusal bypasses, and attempts to coerce tool calls. Direct injection is not unique to agents, but agents raise the impact because the model can act. For public-facing agents, direct injection should be treated like hostile user input: expected, frequent, and never a reason to grant authority.

Direct injection is often easier to rate-limit, classify, and observe because the attacker controls the visible user prompt. It becomes dangerous when the same agent also has privileged application credentials. If a customer-support chatbot can only read public FAQ entries, direct prompt injection is mostly an integrity problem. If it can issue refunds, reset MFA, query internal notes, or write CRM records, it becomes an authorization problem.

Indirect Prompt Injection

Indirect injection is the central agent threat. The attacker places instructions in content the victim’s agent will later process: an email, calendar invite, webpage, GitHub issue, PDF, support ticket, spreadsheet, Slack message, search result, retrieved RAG chunk, image, code comment, log line, or tool output. The victim asks a benign question, and the agent retrieves the poisoned content as data. Greshake et al. framed this class in early LLM-integrated applications, and AgentDojo operationalized it for modern tool-calling agents with realistic tasks such as email, e-banking, Slack, and travel bookings (Greshake et al., 2023; AgentDojo, NeurIPS 2024). (arxiv.org)

Indirect injection is dangerous because it breaks the user-consent intuition. The user did not ask the agent to follow the attacker; the user asked it to summarize, search, compare, or book. In EchoLeak, a crafted email could become the delivery vehicle for a zero-click exfiltration chain against Microsoft 365 Copilot, demonstrating that the threat is not hypothetical for enterprise assistants connected to private data (EchoLeak, AAAI FSS 2025). (arxiv.org)

Tool-Output Injection

Tool-output injection is a subtype of indirect injection where the payload is returned by a tool the agent called. A web-browsing tool returns HTML containing hidden instructions. A database tool returns a row with a malicious “description” field. A GitHub tool returns an issue body that tells the agent to read .env and post it elsewhere. Because agents often append tool outputs directly to the conversation history, the payload persists in the reasoning loop.

AgentDojo’s strongest results come from this setting: external tool data hijacks the agent into malicious actions. The benchmark found that more capable models tended to be easier to attack because they were better at executing both the user’s task and the attacker’s task; GPT-4o had 47.69% targeted ASR under the “Important message” attack in the baseline model comparison, and 57.69% ASR in the defense table under the strongest attack setting (AgentDojo, NeurIPS 2024). (papers.nips.cc)

Tool-Metadata and Tool-Selection Attacks

Tool metadata is often treated as trusted configuration: tool names, descriptions, parameter schemas, examples, server manifests, and MCP capability declarations. In agent systems, these strings are not inert. They are prompts. If an attacker can register a tool, modify a tool description, publish a malicious MCP server, or poison a tool library, they can influence the agent before any tool is executed.

ToolHijacker targets the retrieval-and-selection stage of tool choice by injecting malicious tool documents into a tool library, causing the agent to choose the attacker’s tool for target tasks; the paper reports that tested prevention-based and detection-based defenses were insufficient (ToolHijacker, 2025/NDSS 2026). (arxiv.org) MCPTox generalizes this concern for MCP: malicious instructions embedded in tool metadata, without execution, produced high attack success across real-world MCP servers and prominent LLM-agent settings (MCPTox, 2025). (arxiv.org)

Tool-metadata attacks are especially insidious because developers often review runtime outputs but not schema text. MCP and plugin ecosystems turn schema supply chains into prompt supply chains.

Retrieval and Knowledge-Base Poisoning

RAG systems are agent memory without agency by default, but tool-using RAG agents turn retrieved documents into action context. PoisonedRAG showed that poisoning the external knowledge base can steer generated answers for target questions (PoisonedRAG, USENIX Security 2025). (usenix.org) In agent settings, the same technique can steer tool calls: a poisoned policy document can cause the agent to recommend the attacker’s bank account, prefer a malicious package, change a workflow, or leak data via a cited URL.

RAG poisoning differs from one-shot indirect injection in persistence. Once poisoned content enters an index, it can affect many users and sessions until discovered, deleted, and purged from derived caches, embeddings, summaries, and memories. The write path to the corpus is therefore a security boundary. Teams that secure only the model prompt while allowing untrusted ingestion into vector stores are building a delayed injection system.

Memory Poisoning and Deferred Execution

Long-term memory changes the time dimension. An attacker may not need the payload to trigger immediately. They can cause the agent to store a preference, instruction, contact, credential hint, summary, or “lesson” that later influences a privileged task. ASB includes memory poisoning in its benchmark, reflecting that modern agents retrieve and reuse state across tasks (ASB, ICLR 2025). (arxiv.org)

Memory poisoning forces defenders to answer: who can write memory, what provenance is stored, how memory is retrieved, whether memories are executable instructions or inert facts, how users inspect and revoke memory, and whether memory created from untrusted content can influence high-risk actions.

Multimodal and Rendered-Content Injection

As agents process images, screenshots, PDFs, audio, video, spreadsheets, browser DOMs, and rendered email, injection moves beyond visible text. The payload may be white-on-white text, OCR text in an image, hidden PDF layers, alt text, comments, metadata, Unicode control characters, CSS-hidden DOM nodes, or Markdown that renders differently for humans and machines. Microsoft’s guidance explicitly notes that indirect prompt injection can arise from any modality an instruction-tuned model can process, and that plain text is sufficient but not required (Microsoft MSRC, 2025). (microsoft.com)

Rendered-content injection breaks “human review” if the user sees a sanitized or visually different representation than the model sees. Consent dialogs that summarize an action using model-generated text can also be manipulated if the summary is based on poisoned context.

Empirical State of the Art

AgentDojo: Realistic Tool-Calling Security Cases

AgentDojo is the best current benchmark for reasoning about the security-utility tradeoff in realistic agent tasks. It includes 97 user tasks and 629 security test cases across applications such as workspace, banking, Slack, and travel. It reports benign utility, utility under attack, and targeted ASR, which is the right framing: a defense that blocks attacks by making the agent useless is not a deployable defense (AgentDojo, NeurIPS 2024). (arxiv.org)

The headline result is not merely that agents are vulnerable. It is that conventional prompt-level defenses are brittle and that simple architectural constraints help. In the GPT-4o defense evaluation, no defense yielded 57.69% targeted ASR. Delimiting reduced ASR to 41.65%. Repeating the prompt reduced it to 27.82%. A prompt-injection detector reduced ASR to 7.95% but degraded benign utility to 41.49%. Tool filtering reduced targeted ASR to 6.84% while keeping benign utility at 73.13%, but the authors note it fails when tools cannot be planned in advance or when the same tools needed for the user task can carry out the attack; this was true for 17% of their test cases (AgentDojo, NeurIPS 2024). (papers.nips.cc)

The lesson is precise: pre-committing the tool set before reading untrusted data is powerful when task structure allows it. It converts an ambiguous language problem into a capability problem.

ASB: Broader Agent Attack Surface

Agent Security Bench broadens the evaluation beyond indirect prompt injection. It formalizes attacks and defenses across system prompt handling, user prompt handling, tool use, and memory retrieval, spanning 10 scenarios, more than 400 tools, 27 attack/defense methods, and 13 LLM backbones. Its highest reported average ASR is 84.30%, and the paper concludes that current defenses show limited effectiveness across stages of agent operation (ASB, ICLR 2025). (arxiv.org)

ASB is valuable because production agents rarely fail through a single clean channel. A realistic attack may combine direct prompt injection, malicious retrieval content, tool over-permissioning, memory poisoning, and output handling errors. A benchmark that exercises only one step underestimates compound risk.

InjecAgent and Adaptive Attacks

InjecAgent established an early benchmark for indirect prompt injection in tool-integrated agents, including direct-harm and data-stealing attacks (InjecAgent, Findings ACL 2024). (aclanthology.org) The follow-on adaptive-attack work is more important for defense evaluation. Zhan et al. implemented eight defenses—fine-tuned detector, LLM detector, perplexity filtering, instructional prevention, data prompt isolation, sandwich prevention, paraphrasing, and adversarial fine-tuning—and then designed adaptive attacks against them. Across the targeted defenses and two agent types, adaptive attacks achieved ASR above 50%; for Vicuna-7B, several defenses reduced the original 56% ASR substantially, but adaptive strings bypassed them, and detector detection rates dropped nearly to zero in most adaptive settings (Adaptive Attacks, NAACL Findings 2025). (aclanthology.org)

This result should change how vendors and internal teams evaluate prompt-injection defenses. Static test sets are compliance artifacts, not security evidence. A serious evaluation gives the attacker knowledge of the defense and budget to adapt.

“The Attacker Moves Second”: Defense Claims Under Adaptive Pressure

Nasr et al. argue that robustness claims for jailbreak and prompt-injection defenses are often based on weak evaluation: static attack strings or non-adaptive optimization. They systematically tune and scale gradient descent, reinforcement learning, random search, and human-guided exploration against 12 recent defenses, bypassing most with ASR above 90% even when the original papers reported near-zero ASR (The Attacker Moves Second, 2025). (arxiv.org)

The paper does not imply that all defenses are useless. It implies that model-level and filter-level defenses should be treated as probabilistic risk reducers unless their security claim is enforced by a non-model mechanism. A classifier that catches today’s injections may fail tomorrow. A capability boundary that prevents a tool call from carrying data to an unauthorized domain does not care whether the model was persuaded.

ToolHijacker and MCP Benchmarks

ToolHijacker and MCPTox mark a shift from “malicious content returned by tools” to “malicious tools and tool metadata.” ToolHijacker attacks tool selection itself; MCPTox attacks MCP tool metadata over real servers. MCPTox’s 1,312 malicious cases, 45 live servers, 353 tools, 20 agent settings, 72.8% peak ASR, and sub-3% refusal ceiling show that safety alignment trained for harmful content does not reliably catch unauthorized use of legitimate tools (MCPTox, 2025). (arxiv.org)

This matters because MCP-like protocols standardize a path for external parties to place natural-language tool descriptions directly into model context. Tool metadata needs the same governance as code dependencies: review, signing, pinning, least privilege, provenance, and runtime monitoring.

Real-World Vulnerabilities: EchoLeak

EchoLeak is the canonical production case study because it chains multiple individually plausible mitigations and still obtains zero-click exfiltration. The attack delivered a crafted email to Microsoft 365 Copilot, evaded an XPIA classifier, bypassed link redaction with reference-style Markdown, used auto-fetched images, and abused an allowed Microsoft Teams proxy under the content-security policy. The result was remote, unauthenticated data exfiltration through Copilot’s access to enterprise data (EchoLeak, AAAI FSS 2025). (arxiv.org)

EchoLeak’s general lesson is that prompt injection becomes exploitable when three conditions align:

The agent can read attacker-controlled content.
The agent can read sensitive victim-accessible data.
The agent has any egress path—rendered images, links, tool calls, external messages, logs, or covert channels.

Remove any one condition deterministically and the exploit class is reduced. Try to detect every malicious instruction string and the defender remains in an arms race.

Defense Taxonomy

Prompt and Context Shaping

Prompt hardening, delimiters, XML tags, Markdown fences, system reminders, “do not obey tool output,” and repeated user instructions are the most common defenses because they are cheap. They are also the easiest to overstate.

Delimiting helps because it gives the model a pattern for separating data from instructions. AgentDojo found delimiting reduced GPT-4o targeted ASR from 57.69% to 41.65%, a meaningful but insufficient improvement (AgentDojo, NeurIPS 2024). (papers.nips.cc) Sandwiching or repeating the user prompt after tool output performed better in that specific benchmark, reducing ASR to 27.82%, but the authors explicitly note it is unlikely to withstand adaptive attacks that tell the model to ignore later instructions (AgentDojo, NeurIPS 2024). (papers.nips.cc)

Spotlighting is a stronger version of context shaping. It transforms untrusted text using delimiting, datamarking, or encoding so the model can better distinguish external data from instructions. Microsoft’s Spotlighting paper reports reducing ASR from greater than 50% to below 2% in GPT-family experiments with minimal task impact (Spotlighting, 2024). (arxiv.org) Microsoft describes Spotlighting as a probabilistic preventative technique, not a hard guarantee (Microsoft MSRC, 2025). (microsoft.com) That distinction should govern deployment: use Spotlighting, but do not let it authorize money movement, email sends, or external data transfer.

Detection and Filtering

Detection defenses include regexes, prompt-injection classifiers, LLM-as-judge filters, perplexity filters, embedding similarity, activation-based detectors, and output scanners. They can reduce commodity attacks and provide telemetry. They are not sufficient as the primary security boundary.

AgentDojo’s PI detector lowered ASR to 7.95% but severely degraded utility, indicating high false positives in that configuration (AgentDojo, NeurIPS 2024). (papers.nips.cc) Zhan et al.’s adaptive attacks bypassed detector-based defenses, with detection rates dropping nearly to zero in most adaptive cases (Adaptive Attacks, NAACL Findings 2025). (aclanthology.org) PromptArmor reports strong results on AgentDojo—below 1% false positive and false negative rates and ASR below 1% after removing injected prompts—but such claims need adaptive evaluation against attackers aware of the filter and its removal strategy (PromptArmor, 2025; The Attacker Moves Second, 2025). (arxiv.org)

Detection is best used in four places:

Ingress triage: flag obviously hostile user prompts, documents, tool outputs, and tool metadata.
Context purification: strip or summarize untrusted content before it reaches the planner.
Tool-call validation: detect intent drift, suspicious destinations, data volume anomalies, and mismatched parameters.
Incident response: generate alerts, preserve prompts and tool traces, and trigger memory/corpus quarantine.

Detection should never be the only reason a high-risk tool call is allowed.

Model-Level Training

Model-level defenses try to teach the LLM to respect an instruction hierarchy or ignore instructions from lower-trust channels. OpenAI’s instruction hierarchy work trains models to prioritize privileged instructions over user and third-party instructions (Instruction Hierarchy, 2024). (arxiv.org) StruQ trains a model on structured queries that separate prompt and data channels, causing it to follow instructions in the prompt portion but ignore instructions in the data portion (StruQ, USENIX Security 2025). (arxiv.org) SecAlign uses preference optimization over prompt-injected inputs, secure outputs, and insecure outputs, reporting attack success reductions toward 0% in its setting; Meta SecAlign later released open models with built-in prompt-injection defense and evaluated them over utility and security benchmarks (SecAlign, 2024; Meta SecAlign, 2025). (arxiv.org)

These are important advances. They improve the base rate of correct behavior and may make agents much safer under ordinary attacks. But they do not remove the need for runtime authorization. The Attacker Moves Second directly challenges near-zero robustness claims for model and filter defenses under adaptive attack (The Attacker Moves Second, 2025). (arxiv.org) The engineering posture should be: train models to respect hierarchy, then design systems that remain safe when they fail.

Tool Precommitment and Capability Narrowing

Tool precommitment is one of the clearest practical wins. Before reading untrusted content, the agent decides which tools and permissions are necessary for the user’s task. After untrusted content enters context, the runtime only permits those tools. AgentDojo’s tool filter is a simple implementation: it reduced GPT-4o targeted ASR to 6.84% and improved benign utility relative to no defense, but fails when the same tools serve both benign and malicious goals or when the needed tools cannot be known in advance (AgentDojo, NeurIPS 2024). (papers.nips.cc)

Tool precommitment should be combined with parameter constraints. “Allow Gmail” is too broad. “Allow read-only search over messages matching this query, return at most 10 snippets, no attachments, no send, no forward, no external image rendering” is a security boundary.

Capability narrowing includes:

Separate read and write tools.
Separate internal and external send tools.
Use per-task ephemeral credentials.
Scope OAuth tokens to least privilege.
Disable arbitrary URL fetches unless the destination is allowlisted.
Disable shell access by default; if required, run in a sandbox with no secrets and no network.
Limit data volume returned to the model.
Enforce structured outputs and schemas at the runtime, not by prompt.

Information-Flow Control and Taint Tracking

Information-flow control (IFC) is the principled defense family for prompt injection because it models the core issue: untrusted data should not influence privileged control flow or leak trusted secrets to unauthorized sinks.

CaMeL is the best-known concrete architecture. It separates control flow from data flow: a privileged model extracts a program from the trusted user query, while untrusted data is processed in restricted contexts and cannot alter the program flow. CaMeL also uses capabilities to prevent private data exfiltration over unauthorized flows. The authors report solving 67% of AgentDojo tasks with provable security in the arXiv abstract (CaMeL, 2025). (arxiv.org)

The key contribution is not the exact task score. It is the design stance: do not ask the same model instance to both read attacker-controlled data and decide what authority to exercise. Use separate components, typed data, taint labels, and a policy engine that mediates every sink.

A minimal IFC policy for agents tracks labels such as:

trusted_user_instruction
trusted_developer_policy
untrusted_web
untrusted_email
untrusted_tool_metadata
private_user_data
secret_credential
public_output
external_network_sink
internal_write_sink

Then enforce rules:

untrusted_* may influence summaries but not tool authorization.
private_user_data may flow to the user but not to external URLs or third-party messages without explicit authorization.
secret_credential may never flow to model-visible text.
Tool metadata may describe syntax but may not grant new tools or override policy.
Memory derived from untrusted_* cannot be retrieved as instruction.

Human Authorization

Human-in-the-loop is often presented as a fallback. For high-risk actions, it is a required authorization layer. The challenge is making it meaningful.

A useful confirmation dialog must be generated from structured runtime data, not from the potentially compromised model’s prose. It should show:

the exact action,
destination,
data fields to be sent,
source provenance,
risk classification,
whether untrusted content influenced the action,
diff from previous state,
alternative safe action.

Bad consent: “The assistant wants to send the requested information. Allow?”

Good consent: “Send email from alice@company.com to external address eve.example with 3 attachments from SharePoint/Finance/Q4, after reading an email from an external sender that contained instruction-like text. This may disclose confidential data. Allow?”

Human authorization should not be used to compensate for broad credentials. If the agent can exfiltrate through rendered Markdown images before the human sees the response, the consent layer is downstream of the leak.

A Secure Reference Architecture for Tool-Using Agents

This architecture implements five security rules.

Rule 1: Plan Before Exposure

The trusted planner sees the user request and developer policy, but not untrusted content. It produces a task plan and capability manifest: allowed tools, scopes, data classes, maximum rows, allowed domains, write permissions, and consent thresholds. The planner can be an LLM, but its output must be compiled into a deterministic policy.

Example:

json

{
  "task": "summarize_unread_email_from_hr",
  "allowed_tools": [
    {
      "name": "email.search",
      "mode": "read",
      "query_constraints": {
        "from_domain": "company.com",
        "label": "HR"
      },
      "max_results": 20
    }
  ],
  "forbidden_tools": [
    "email.send",
    "http.fetch_external",
    "slack.post"
  ],
  "egress": {
    "external_network": false
  },
  "requires_consent": [
    "any_write",
    "external_recipient",
    "attachment_access"
  ]
}

Rule 2: Quarantine Untrusted Data

Untrusted content should enter a quarantine store with provenance and taint labels. The main planner should not receive raw HTML, raw email bodies, raw MCP descriptions, or raw PDFs unless the task requires it. Use extractors that produce typed data: title, sender, date, plain-text body, links, attachment metadata, visible text, hidden text warnings, and detected instruction-like spans.

For many tasks, the agent does not need raw text. It needs a bounded extraction: “return the invoice total,” “list flight options,” “extract issue title and reproduction steps.” Reducing context is a security control.

Rule 3: Separate Planning, Reading, and Acting

A single model instance should not read untrusted content and then decide high-risk actions. Use separate roles:

Planner: trusted context only; no untrusted data.
Reader/extractor: untrusted data; no tools except local parsing.
Actor: receives typed data and a capability manifest; proposes actions.
Policy engine: deterministic; mediates all actions.
Auditor/critic: optional model or rules engine; no authority to execute.

This mirrors the dual-LLM and design-pattern literature: untrusted input can be processed by quarantined models, but consequential actions require a trusted control path (Design Patterns for Securing LLM Agents, 2025). (arxiv.org)

Rule 4: Validate Tool Calls Outside the Model

Every tool call is untrusted output. Validate:

tool is allowed for this task,
parameters match schema,
destination is allowed,
data volume is within limits,
taint flows are allowed,
action matches user intent,
write requires consent,
irreversible actions require step-up authentication,
tool result will not create a new unbounded capability.

The validator should fail closed. The model may explain why it wants an exception, but it may not grant itself one.

Rule 5: Treat Tool Metadata as Supply Chain

Tool descriptions, MCP manifests, schemas, and examples must be governed:

pin server identities and versions,
sign tool manifests,
review diffs to natural-language descriptions,
strip hidden Unicode and rendered-invisible text,
prohibit imperative instructions in descriptions except syntax guidance,
separate user-visible descriptions from model-facing schemas,
apply least privilege per server,
log every tool added to an agent context,
require approval for tools that can read secrets, write external messages, or execute code.

MCPTox shows that tool metadata itself is an attack surface; production systems should not ingest arbitrary MCP server descriptions into a privileged agent context without review and policy wrapping (MCPTox, 2025). (arxiv.org)

Defense-in-Depth Controls That Work Today

Capability Design

The highest-impact defense is to reduce what a compromised model can do. Build tools as narrow commands, not generic APIs.

Bad:

tool("gmail_api", description="Use Gmail API to read, send, forward, and manage email.")

Better:

tool("email_search_readonly", scopes=["mail.read"], max_results=10)
tool("email_draft_reply", scopes=["mail.draft"], requires_user_send=True)
tool("email_send_external", scopes=["mail.send"], requires_step_up=True, data_loss_policy="strict")

Avoid tools that combine read and write across trust zones. The classic exfiltration pattern requires reading private data and sending it to an attacker. Splitting those capabilities makes the attack need two independently authorized flows.

Egress Control

Block common exfiltration sinks deterministically:

Markdown images and remote media,
auto-fetched URLs,
arbitrary links with encoded data,
external webhooks,
DNS-like subdomain leaks,
public paste or repo writes,
Slack/Teams posts to external or attacker-controlled channels,
email sends and forwards,
tool-call side channels observable by attackers.

EchoLeak demonstrates why egress control must include rendered content and allowed corporate proxies, not only explicit “send” tools (EchoLeak, AAAI FSS 2025). (arxiv.org)

Provenance-Aware Memory

Memory writes should be explicit, typed, and reviewable. Never store raw instructions from untrusted content as future behavioral guidance. A memory record should include:

json

{
  "content": "Alice prefers concise weekly summaries.",
  "type": "user_preference",
  "source": "direct_user_statement",
  "trust": "trusted_user",
  "created_at": "2026-05-01T12:00:00Z",
  "allowed_uses": [
    "formatting"
  ],
  "forbidden_uses": [
    "authorization",
    "tool_selection"
  ]
}

For untrusted sources, default to non-instructional facts:

json

{
  "content": "External webpage claims Vendor X supports SAML.",
  "type": "unverified_external_claim",
  "source": "web",
  "trust": "untrusted",
  "allowed_uses": [
    "answer_with_citation"
  ],
  "forbidden_uses": [
    "credential_request",
    "purchase_decision",
    "tool_authorization"
  ]
}

RAG Corpus Controls

Protect the write path:

authenticate corpus writers,
scan and diff new documents,
store document provenance,
maintain allowlists for high-trust sources,
separate public web indexes from internal policy corpora,
quarantine newly ingested documents before they influence high-risk tasks,
detect embedding outliers and near-duplicate poison clusters,
require citations for retrieved claims,
prevent retrieved content from becoming instructions.

RAG retrieval should return typed evidence, not authority. The system prompt should state this, but the runtime should enforce it by preventing retrieved chunks from granting tools or changing policy.

Sandboxing and Secret Hygiene

If an agent can run code, browse, or use a shell, assume prompt injection will eventually reach that environment. Run it like hostile code:

ephemeral container or VM,
no ambient cloud credentials,
no developer home directory,
no SSH keys,
no production network by default,
egress allowlist,
file-system allowlist,
time and cost limits,
syscall and package-install restrictions,
full transcript and artifact capture.

Do not put secrets in model context. Do not let the model read .env, cloud metadata endpoints, local credential stores, browser cookies, or SSH keys. If a tool needs a credential, the tool should use it server-side and return only the minimum result.

Monitoring and Incident Response

Agent telemetry must be security-grade. Log:

prompts and tool outputs with provenance,
model outputs that propose tool calls,
validator decisions,
denied actions,
consent prompts and user decisions,
egress attempts,
memory writes,
tool manifest changes,
retrieved document IDs and hashes.

Use canary secrets that should never be emitted. Monitor for encoded data in URLs, sudden tool-use drift, unexpected external recipients, high-volume reads followed by writes, repeated validator denials, and tool calls influenced by untrusted content.

Incident response must include memory and corpus cleanup. A poisoned memory or vector chunk can persist after the triggering conversation is deleted.

Evaluation: How to Know Whether a Defense Works

Metrics

Use at least four metrics:

Benign utility: fraction of normal tasks completed correctly.
Targeted ASR: fraction of attacks achieving the attacker’s goal.
Utility under attack: fraction of cases where the user task succeeds without adversarial side effects.
Impact severity: what the attack achieved—read, write, exfiltration, code execution, financial action, privilege escalation.

AgentDojo’s metric split is the right baseline because it prevents defenders from hiding behind either useless security or insecure utility (AgentDojo, NeurIPS 2024). (papers.nips.cc)

Test Static, Adaptive, and System-Level Attacks

A serious evaluation includes:

known prompt-injection strings,
transformed variants: paraphrase, translation, encoding, hidden text, Unicode,
indirect payloads in every source type,
tool-output payloads,
tool-metadata payloads,
RAG poisoning,
memory poisoning,
multimodal payloads,
multi-turn attacks,
adaptive attacks with knowledge of defenses,
exploit-chain tests that combine injection with egress.

If a vendor reports “0% ASR” only on a static dataset, treat it as a starting point. Zhan et al. and Nasr et al. show that adaptive attacks can reverse strong-looking defense results (Adaptive Attacks, NAACL Findings 2025; The Attacker Moves Second, 2025). (aclanthology.org)

Define Security Invariants

For each agent, write invariants that must hold even if the model follows the attacker’s instructions:

The agent never sends private data to an external domain.
The agent never sends email without user approval.
The agent never executes shell commands outside the sandbox.
The agent never reads files outside the project directory.
The agent never installs an MCP server without approval.
Retrieved documents never change tool permissions.
Tool descriptions never override developer policy.
Untrusted content never writes behavioral memory.

Then write tests that attempt to violate each invariant. This shifts evaluation from “did the model seem robust?” to “did the system enforce the boundary?”

What Not to Rely On

System prompts are necessary but not sufficient. They are instructions to the component under attack.

Regex filters catch commodity payloads but fail under paraphrase, encoding, translation, and benign-looking social engineering.

LLM judges can be prompt-injected and can disagree with the protected model.

Output moderation APIs are usually designed for harmful content, not unauthorized use of legitimate tools.

General-purpose “agent firewalls” need scrutiny: where do they sit, what context do they see, do they enforce deterministic policy, how do they handle adaptive attacks, and what is their false-positive cost?

Human review fails if the user sees a different representation than the model, if the summary is model-generated from poisoned context, or if exfiltration occurs before review.

Fine-tuned secure models improve behavior but cannot substitute for least privilege and information-flow control.

Open Problems

Secure General-Purpose Autonomy

The strongest defenses constrain agents: narrow tools, preplanned capabilities, no arbitrary egress, no broad memory, no unsandboxed code. The hardest open problem is preserving useful general-purpose autonomy while maintaining deterministic safety. Agents that can browse arbitrary sites, install tools, write code, access credentials, and communicate externally have too many composition paths for prompt-only defenses.

Robust Adaptive Evaluation

The field needs standardized adaptive evaluation budgets. A defense paper should report static ASR, adaptive ASR, attacker knowledge, optimization budget, model access, transferability, false positives, latency, and utility. Without this, near-zero ASR claims remain difficult to compare.

Tool-Metadata Governance

MCP and tool ecosystems need signed manifests, policy-aware clients, permission prompts that distinguish data and actions, safe schema languages, and scanners for instruction-like metadata. Current tool descriptions are simultaneously documentation, model prompt, and supply-chain artifact.

Practical Information-Flow Control

CaMeL shows a credible direction, but production IFC for agents needs developer-friendly abstractions. Engineers need policy languages, taint-aware data objects, debuggers, logs, and framework integrations that do not require rewriting every agent from scratch.

Memory Safety

Long-term memory needs provenance, taint, expiration, user inspection, revocation, conflict resolution, and non-instructional storage by default. The industry has not converged on safe memory semantics.

Multimodal Injection

Agents increasingly rely on screenshots, PDFs, diagrams, spreadsheets, and audio. Defenses need to reason over visible versus hidden content, OCR provenance, rendered DOM differences, metadata, and model-specific perception gaps.

Measuring Real-World Incidence

We have strong benchmarks and case studies but limited public incident data. Enterprises should publish anonymized injection attempts, defense outcomes, and exploit chains the way they publish malware and phishing intelligence. Without field data, benchmarks risk optimizing for yesterday’s payloads.

Bottom Line for Builders

A secure tool-using LLM agent is not a better prompt wrapped around a powerful model. It is a constrained distributed system in which language models propose actions and deterministic components authorize them. The model should be helpful, trained for instruction hierarchy, and shielded by Spotlighting, structured context, and detectors. But the security boundary must be outside the model: least privilege, precommitted tools, taint tracking, egress control, sandboxing, memory provenance, signed tool metadata, meaningful human consent, and auditability.

The practical rule is simple: after an agent has read untrusted content, it must not be able to perform consequential actions unless an independent policy engine proves that the action is within the user’s original intent and allowed information flows. This is the difference between hoping the model resists prompt injection and engineering a system that remains safe when it does not.

References

Chen, Sizhe; Piet, Julien; Sitawarin, Chawin; Wagner, David. “StruQ: Defending Against Prompt Injection with Structured Queries.” USENIX Security, 2025. https://www.usenix.org/conference/usenixsecurity25/presentation/chen-sizhe
Chen, Sizhe; Zharmagambetov, Arman; Wagner, David; Guo, Chuan. “Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks.” arXiv, 2025. https://arxiv.org/abs/2507.02735
Debenedetti, Edoardo; Zhang, Jie; Balunović, Mislav; Beurer-Kellner, Luca; Fischer, Marc; Tramèr, Florian. “AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents.” NeurIPS Datasets and Benchmarks, 2024. https://arxiv.org/abs/2406.13352
Debenedetti, Edoardo; Shumailov, Ilia; Fan, Tianqi; Hayes, Jamie; Carlini, Nicholas; Fabian, Daniel; Kern, Christoph; Shi, Chongyang; Terzis, Andreas; Tramèr, Florian. “Defeating Prompt Injections by Design.” arXiv, 2025. https://arxiv.org/abs/2503.18813
Greshake, Kai; Abdelnabi, Sahar; Mishra, Shailesh; Endres, Christoph; Holz, Thorsten; Fritz, Mario. “Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.” arXiv, 2023. https://arxiv.org/abs/2302.12173
Hines, Keegan; Lopez, Gary; Hall, Matthew; Zarfati, Federico; Zunger, Yonatan; Kiciman, Emre. “Defending Against Indirect Prompt Injection Attacks With Spotlighting.” arXiv / Microsoft Research, 2024. https://arxiv.org/abs/2403.14720
Microsoft Security Response Center. “How Microsoft Defends Against Indirect Prompt Injection Attacks.” MSRC Blog, 2025. https://www.microsoft.com/en-us/msrc/blog/2025/07/how-microsoft-defends-against-indirect-prompt-injection-attacks/
Nasr, Milad; Carlini, Nicholas; Sitawarin, Chawin; Schulhoff, Sander V.; Hayes, Jamie; Ilie, Michael; Pluto, Juliette; Song, Shuang; Chaudhari, Harsh; Shumailov, Ilia; Thakurta, Abhradeep; Xiao, Kai Yuanqing; Terzis, Andreas; Tramèr, Florian. “The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections.” arXiv, 2025. https://arxiv.org/abs/2510.09023
OWASP GenAI Security Project. “LLM01:2025 Prompt Injection.” OWASP Top 10 for LLM Applications, 2025. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
Piet, Julien; Sitawarin, Chawin; Chen, Sizhe; Wagner, David. “SecAlign: Defending Against Prompt Injection with Preference Optimization.” arXiv, 2024. https://arxiv.org/abs/2410.05451
Reddy, Pavan; Gujral, Aditya Sanjay. “EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System.” AAAI Fall Symposium Series, 2025. https://arxiv.org/abs/2509.10540
Shi, Jiawen; Yuan, Zenghui; Tie, Guiyao; Zhou, Pan; Gong, Neil Zhenqiang; Sun, Lichao. “Prompt Injection Attack to Tool Selection in LLM Agents.” arXiv / NDSS, 2025/2026. https://arxiv.org/abs/2504.19793
Wallace, Eric; Xiao, Kai; Leike, Reimar; Weng, Lilian; Heidecke, Johannes; Beutel, Alex. “The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions.” arXiv, 2024. https://arxiv.org/abs/2404.13208
Wang, Zhiqiang; Gao, Yichao; Wang, Yanting; Liu, Suyuan; Sun, Haifeng; Cheng, Haoran; Shi, Guanquan; Du, Haohua; Li, Xiangyang. “MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers.” arXiv, 2025. https://arxiv.org/abs/2508.14925
Zhan, Qiusi; Liang, Zhixiang; Ying, Zifan; Kang, Daniel. “InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents.” Findings of ACL, 2024. https://aclanthology.org/2024.findings-acl.624/
Zhan, Qiusi; Liang, Zhixiang; Ying, Zifan; Kang, Daniel. “Adaptive Attacks Break Defenses Against Indirect Prompt Injection Attacks on LLM Agents.” NAACL Findings, 2025. https://aclanthology.org/2025.naacl-findings.395/
Zhang, Hanrong; Huang, Jingyuan; Mei, Kai; Yao, Yifei; Wang, Zhenting; Zhan, Chenlu; Wang, Hongwei; Zhang, Yongfeng. “Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-Based Agents.” ICLR, 2025. https://arxiv.org/abs/2410.02644
Zou, Andy; Wang, Zifan; Carlini, Nicholas; Nasr, Milad; Kolter, J. Zico; Fredrikson, Matt. “PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models.” USENIX Security, 2025. https://www.usenix.org/conference/usenixsecurity25/presentation/zou-poisonedrag