Prompt Injection, Tool Hijacking, and Data Exfiltration Defenses in RAG/Agent Systems
Model Version
Securing the Reasoning Layer: Prompt Injection, Tool Hijacking, and Data Exfiltration in RAG/Agent Systems
Large language models integrated with retrieval systems, tool-use frameworks, and autonomous agent architectures have introduced a fundamentally new class of security vulnerabilities that no existing defense fully solves. Prompt injection—ranked the #1 threat in the OWASP Top 10 for LLM Applications 2025—remains an open problem analogous to the early days of SQL injection, but without the equivalent of parameterized queries. The attack surface has expanded dramatically since 2023: indirect prompt injection via retrieved documents achieves 90% attack success rates with just five poisoned texts among millions (PoisonedRAG, USENIX Security 2025); tool hijacking through the Model Context Protocol reaches >72% success against 20 LLM agents (MCPTox, 2025); and side-channel exfiltration techniques have yielded critical CVEs against Microsoft 365 Copilot, GitHub Copilot, and Google Gemini throughout 2024–2025. This article provides a comprehensive technical analysis of these threat classes, evaluates the current state of defenses, and proposes a defense-in-depth architecture grounded in the latest academic research and real-world incident data.
The threat taxonomy spans three interconnected attack surfaces
The attack landscape against RAG and agentic AI systems can be decomposed into three primary threat classes, each with distinct attack vectors, prerequisites, and impact profiles. Understanding their interactions is critical—real-world exploits frequently chain techniques across categories.
Simon Willison's "Lethal Trifecta" framework (June 2025) captures the fundamental vulnerability condition: any system combining (1) access to private data, (2) exposure to untrusted content, and (3) an exfiltration vector is vulnerable to data theft. Meta extended this with the "Agents Rule of Two" (October 2024), adding state-changing tool access as a critical risk property. These frameworks map directly to the OWASP LLM Top 10 2025 categories, where prompt injection (LLM01), sensitive information disclosure (LLM02), excessive agency (LLM06), system prompt leakage (LLM07), and vector/embedding weaknesses (LLM08) represent the five most relevant entries. MITRE ATLAS catalogs these under techniques AML.T0051 (prompt injection), AML.T0053 (plugin compromise), AML.T0024 (exfiltration via inference API), and AML.T0057 (LLM data leakage), with 14 new agent-focused techniques added in October 2025.
NIST AI 600-1 (July 2024), the Generative AI profile of the AI Risk Management Framework, identifies information security as one of 12 risk categories, explicitly calling out data poisoning and prompt injection. It recommends red-teaming, content provenance tracking, and pre-deployment testing as core mitigations—guidance that, while directionally correct, remains at a higher level of abstraction than the specific technical controls this article examines.
Prompt injection exploits the fundamental inability of LLMs to separate code from data
Prompt injection is the foundational vulnerability from which most other attacks derive. It exists because LLMs process instructions and data in the same channel—there is no hardware-enforced separation between "code" (system prompts, instructions) and "data" (user input, retrieved content, tool outputs). Every token in the context window competes for the model's attention on equal footing.
Direct injection: the attacker is the user
Direct prompt injection occurs when a user crafts input to override system-level instructions. The HackAPrompt competition (Schulhoff et al., EMNLP 2023, Best Theme Paper) documented 29 distinct attack techniques across 600,000+ adversarial prompts, establishing a comprehensive taxonomy. The "DAN" (Do Anything Now) jailbreak family against ChatGPT demonstrated that role-play framing consistently bypasses content filters. Tensor Trust (Toyer et al., NeurIPS 2023) contributed 563,000 attacks and 118,000 defenses from an online game, providing the first dataset with both human-generated attacks and defenses.
More sophisticated direct attacks include context overflow (padding input to push system prompts out of effective context), instruction hierarchy impersonation (prefixing injections with [SYSTEM] or <Admin> tags), and multi-turn escalation (gradually shifting model behavior across conversation turns). The Chevrolet dealership chatbot incident (December 2023) illustrated the real-world impact: a two-step injection caused the ChatGPT-powered bot to agree to sell a $76,000 Tahoe for $1, declaring it "a legally binding offer—no takesies backsies."
Indirect injection: the data layer becomes the attack surface
Indirect prompt injection, formalized by Greshake, Abdelnabi, Mishra, Endres, Holz, and Fritz (ACM AISec 2023, Black Hat US 2023), represents the more consequential threat class. Adversaries inject malicious instructions into content that an LLM will process: web pages, emails, documents in RAG knowledge bases, calendar invitations, code comments, and tool outputs. The LLM cannot distinguish these embedded instructions from legitimate ones.
Real-world exploitation has been extensive. The Bing Chat sidebar attack (2023) used CSS-hidden text in web pages to manipulate search results and extract data through Edge's AI sidebar. The Google Gemini Gmail attack (2025, discovered by Marco Figueroa of 0DIN) embedded white-on-white HTML text with <Admin> tags in emails; when users clicked "Summarize this email," Gemini generated fake security warnings with phishing phone numbers. GeminiJack (Noma Security, August 2025) demonstrated zero-click exfiltration across Gmail, Calendar, and Docs via a single malicious shared document, extracting years of email correspondence. Google confirmed the vulnerability and deployed mitigations by November 2025.
Payload obfuscation techniques continue to evolve. A 2025 study (arXiv:2504.11168) tested character injection, Unicode tag smuggling, zero-width characters, and homoglyph substitution against six commercial guardrails, achieving up to 100% evasion—including 71.98% bypass of Azure Prompt Shield and 70.44% bypass of Meta Prompt Guard. The PAPILLON fuzzer (USENIX Security 2025, Gong et al.) generates semantically coherent jailbreak prompts maintaining low perplexity, causing less than 10% attack success rate drop even with perplexity filtering active. AutoDAN uses genetic algorithms to produce stealthy prompts that bypass perplexity-based defenses entirely.
The ChatInject attack (2025) exploits LLM chat template parsing, improving indirect injection success from 5.18% to 32.05% on AgentDojo and from 15.13% to 45.90% on InjecAgent. PoisonedRAG (Zou et al., USENIX Security 2025) achieves 90–99% attack success by injecting just 5 malicious texts per target question into a knowledge database with millions of entries, formulating corpus poisoning as an optimization problem with retrieval and generation conditions.
The defense landscape: probabilistic mitigations without complete solutions
No defense provides complete protection against prompt injection. The UK National Cyber Security Centre has stated this "may simply be an inherent issue with LLM technology." Defenses fall into several categories, each with documented limitations.
Instruction hierarchy enforcement. Wallace, Xiao, Leike, Weng, Heidecke, and Beutel (OpenAI, April 2024, arXiv:2404.13208) proposed training LLMs to enforce explicit priority ordering: System Message > User Message > Third-Party Content. Applied to GPT-3.5, this achieved a 63% improvement in defense against system prompt extraction and generalized to unseen attack types. However, security researcher Johann Rehberger demonstrated bypasses on GPT-4o-mini within days of its launch (July 2024). The Control Illusion paper (AAAI 2026, arXiv:2502.15851) further showed that models exhibit greater sensitivity to implicit societal hierarchies (CEO vs. intern) than to explicit system/user role designations, revealing fundamental limitations. Instructional Segment Embeddings (ISE, ICLR 2025) provides architecture-level modifications using separate positional embeddings for different instruction segments, achieving up to 15.75% robustness improvement.
Spotlighting and datamarking. Hines, Lopez, Hall, Zarfati, Zunger, and Kıcıman (Microsoft Research, 2024, arXiv:2403.14720, CAMLIS 2024) developed three modes of marking untrusted content: delimiting (randomized text delimiters), datamarking (inserting special tokens between every word), and encoding (Base64/ROT13 transformation). Results were striking: attack success rates dropped from >50% to below 2%, with encoding achieving 0.0% ASR on summarization and Q&A tasks, and negligible degradation on SQuAD, IMDB, and SuperGLUE benchmarks. This remains one of the most effective single-layer defenses available.
LLM-based guardrails. Llama Guard (Meta) achieves approximately 67% detection on custom prompt injection categories but is vulnerable to mutated prompts. NeMo Guardrails (NVIDIA) provides programmable conversation flow control via the Colang DSL with 89% accuracy, but its jailbreak detection component showed 65.22% vulnerability to evasion. Constitutional Classifiers (Anthropic, Sharma and Perez et al., January 2025, arXiv:2501.18837) reduced jailbreak success from 86% to 4.4% with a 0.38% false positive rate, though at 23.7% compute overhead. The follow-up Constitutional Classifiers++ (CC++, January 2026, arXiv:2601.04603) introduced a two-stage architecture cutting compute overhead to ~1% while maintaining effectiveness across 1,700+ hours of red teaming.
SecAlign (Chen et al., ACM CCS 2025) represents a training-time defense using preference optimization, becoming the first method to reduce prompt injection success rates to <10% even against attacks more sophisticated than the training data. However, "The Attacker Moves Second" (2025) demonstrated that 12 published defenses can be bypassed with >90% attack success using adaptive attacks, with human red-teaming scoring 100% bypass rates.
PromptArmor (Shi et al., July 2025) achieves near-zero ASR (0–0.47%) on AgentDojo using an off-the-shelf LLM to detect and remove injections before processing, with utility up to 76.35%. A multi-layered defense framework (Balaji et al., November 2025, arXiv:2511.15759) combining multiple techniques reduced attack success from 73.2% to 8.7% across an 847-case benchmark—demonstrating that layered composition of imperfect defenses provides substantial protection.
Tool hijacking turns agent capabilities into attack vectors
The emergence of tool-using LLM agents—accelerated by Anthropic's Model Context Protocol (MCP, November 2024)—has created attack surfaces that extend well beyond text manipulation. When an attacker can influence which tools an agent calls, what arguments it passes, or how it interprets tool outputs, they effectively control the agent's real-world actions.
MCP server poisoning and supply chain attacks
The MCP ecosystem has already experienced significant real-world compromises. Trend Micro discovered an unpatched SQL injection vulnerability in Anthropic's reference SQLite MCP server—forked 5,000+ times before archival on May 29, 2025. The vulnerability allowed attackers to seed malicious prompts via SQLi, which were then executed by agents retrieving data. A supply chain attack via a malicious "Postmark MCP Server" package silently BCC'd all email communications to an attacker's address. An Asana MCP server bug enabled cross-tenant data leakage between organizations.
CVE-2025-6514 (CVSS 9.6), discovered by JFrog, affected mcp-remote versions 0.0.5–0.1.15 (437,000+ downloads, featured in integration guides from Cloudflare, Hugging Face, and Auth0). Malicious MCP servers could send crafted authorization_endpoint URLs triggering arbitrary OS command execution on client machines. CVE-2025-49596 (CVSS 9.4) in MCP Inspector enabled unauthenticated RCE through a CSRF + 0.0.0.0 browser vulnerability in the inspector-proxy architecture.
Tool description injection: invisible instructions in metadata
Invariant Labs (April 2025) first identified Tool Poisoning Attacks (TPAs), where malicious instructions are embedded in MCP tool descriptions visible to the LLM but hidden from users in simplified UIs. A demonstration on Cursor showed a seemingly innocent add(a, b) tool whose docstring contained hidden instructions to read ~/.cursor/mcp.json (containing credentials), read ~/.ssh/id_rsa (SSH private keys), transmit data via a hidden parameter, and mask exfiltration behind mathematical explanations.
The MCPTox benchmark (2025, arXiv:2508.14925)—the first systematic evaluation of MCP tool poisoning—tested 45+ real-world MCP servers across 8 domains with 1,312 malicious test cases. Across 20 LLM agents, the highest attack success rate exceeded 72%, while the maximum safety refusal rate was below 3%, demonstrating that current alignment training is ineffective against tool poisoning. ToolHijacker (arXiv:2504.19793, 2025) showed that injecting a single malicious tool document into ToolBench's 9,650 benign tools achieved 96.1% hijack rate with gradient-free attacks and 97.8% with gradient-based attacks.
Cross-server shadowing and confused deputy attacks
Cross-server shadowing, documented by Invariant Labs, is particularly insidious: when multiple MCP servers connect to the same client, a malicious server's tool description can override behavior of tools from other trusted servers. A bogus add tool's description redirected all emails sent via a separate, trusted email server to an attacker's address. The agent complied even when the user explicitly specified a different recipient. The malicious tool never needs to be executed—it only needs to be loaded into the agent's context.
The GitHub MCP server attack (May 2025) demonstrated a confused deputy: a prompt injection payload hidden in a public GitHub issue hijacked an AI assistant connected via the official GitHub MCP server, causing it to pull data from private repositories and leak it to a public repo via pull request. The root cause was an over-privileged Personal Access Token combined with the LLM's inability to distinguish attacker instructions from legitimate task context.
Rug pull attacks exploit temporal trust: a malicious server initially provides legitimate functionality, builds user trust, then silently modifies tool definitions. Most MCP clients treat tool definitions as static after initial load and do not re-notify users of changes.
Defenses for tool hijacking
The MCP specification's security model has been criticized for its "fundamentally optimistic trust model" that equates syntactic correctness with semantic safety. Key limitations include no built-in authentication, no cryptographic tool definition signing, no change detection, and session IDs placed in URL query strings.
ETDI (Enhanced Tool Definition Interface) (Bhatt, Narajala, and Habler, 2025, arXiv:2506.01333) proposes cryptographic identity verification, immutable versioned tool definitions, and OAuth 2.0-enhanced permission management. A PR was submitted to the MCP Python SDK but was redirected to specification-level discussion. MCP-Scan (Invariant Labs) provides open-source scanning for tool poisoning, rug pulls, and cross-origin escalations. MindGuard (Wang et al., 2025, arXiv:2508.20412) uses LLM attention mechanisms to track tool invocation decisions via a "Decision Dependence Graph," achieving 94–99% precision in detecting poisoned invocations with sub-second processing and no additional token cost.
The OWASP Top 10 for Agentic Applications (released December 10, 2025, developed by 100+ researchers) codifies these risks across 10 categories: Agent Goal Hijack (ASI01), Tool Misuse (ASI02), Excessive Agency (ASI03), Delegated Trust (ASI04), cascading failures, human-agent trust exploitation, rogue agents, memory/context poisoning, supply chain compromise, and observability gaps. Practical controls include least-privilege tool access with scoped ephemeral credentials, human-in-the-loop with full argument visibility for sensitive operations, sandboxed execution environments, and runtime tool definition pinning via cryptographic hashes.
Data exfiltration transforms AI assistants into surveillance tools
Data exfiltration attacks exploit the information asymmetry between what an LLM can access and what it should reveal. They transform AI assistants into unwitting exfiltration agents, extracting system prompts, training data, retrieval corpus contents, and user-specific data through side channels invisible to the end user.
Markdown image rendering: zero-click exfiltration
The most consistently exploited exfiltration channel is markdown image rendering. When an LLM outputs  and the client renders it, the browser issues a GET request to the attacker's server carrying the encoded payload—no user interaction required. Johann Rehberger (Embrace The Red) has documented this class across virtually every major LLM product since mid-2023: ChatGPT, Google Bard/Gemini, Google AI Studio, Microsoft Copilot, Slack AI, GitLab Duo, GitHub Copilot Chat, xAI Grok, Claude iOS, and ChatGPT Operator.
CVE-2025-32711 ("EchoLeak"), discovered by Aim Security, demonstrated zero-click exfiltration from Microsoft 365 Copilot. External emails with hidden instructions caused Copilot to embed sensitive data in markdown image URLs auto-fetched by the client, bypassing the XPIA classifier, external-link redaction, and Content Security Policy. CVE-2025-53773 (CVSS 9.6) enabled remote code execution in GitHub Copilot through prompt injection in code comments, leading GitHub to completely disable image rendering in Copilot Chat (August 14, 2025).
ASCII smuggling: invisible data in clickable links
ASCII smuggling uses Unicode characters from the Tags block (U+E0000–U+E007F) that mirror ASCII but are invisible in user interfaces. Rehberger disclosed the full exploit chain against Microsoft 365 Copilot in early 2024, demonstrating how injected prompts in emails caused Copilot to encode sensitive data (including MFA codes) into benign-looking hyperlinks using invisible characters. Microsoft patched by disabling link rendering. FireTail (2025–2026) reported that Gemini, DeepSeek, and Grok remain susceptible to ASCII smuggling, with Google reportedly dismissing the threat as "social engineering."
Training data extraction at scale
Carlini et al. established the foundational work on training data extraction (USENIX Security 2021), demonstrating verbatim extraction of PII, code, and 128-bit UUIDs from GPT-2. The follow-up by Nasr, Carlini et al. (2023, ICLR 2025) introduced two devastating attacks against production models. The divergence attack caused ChatGPT to emit training data at 150× the normal rate by prompting it to repeat a single word indefinitely, with approximately 3% of divergent output being verbatim copies from the internet. The fine-tuning attack used OpenAI's fine-tuning API to "undo" alignment. Together, these extracted >10,000 unique verbatim training examples from gpt-3.5-turbo using only $200 USD in API costs, including PII of dozens of individuals, NSFW content, Bitcoin addresses, and code. Current alignment techniques do not eliminate memorization.
Embedding inversion recovers original text from vectors
Morris et al. (EMNLP 2023, "Text Embeddings Reveal (Almost) As Much As Text") developed Vec2Text, recovering 92% of 32-token text inputs exactly from GTR-base embeddings and successfully extracting full patient names from clinical notes in the MIMIC-III dataset. This directly threatens RAG systems storing document embeddings in vector databases like Pinecone, Weaviate, or Milvus. Follow-up work established transferable black-box attacks (ACL 2024) and multilingual inversion (ACL 2024). A 2025 reproducibility study (ACM RecSys 2025) confirmed the results and proposed quantization-based defense as a lightweight, hyperparameter-free alternative to Gaussian noise addition, preserving retrieval quality (nDCG@10 stable) while significantly degrading inversion capability.
System prompt extraction: near-perfect against short prompts
PLeak (Hui et al., ACM CCS 2024) optimizes adversarial queries using shadow LLMs and gradient-based methods, successfully transferring attacks to production platforms including apps on Poe. Raccoon (Wang et al., ACL Findings 2024) benchmarks 14 attack categories—summarizer attacks, context resets, role-play exploitation, translation-based extraction, encoded output, and multi-part attacks—finding GPT-4 extraction rates of approximately 83.5%. SPE-LLM (Das et al., 2025) confirmed that shorter system prompts are more vulnerable, with near-perfect extraction achievable. ProxyPrompt (2025, arXiv:2505.11459) proposes replacing real system prompts with obfuscated proxies that preserve utility but resist extraction.
Defenses against data exfiltration
Effective exfiltration defense requires multiple complementary layers. Output filtering should strip or neutralize markdown images, HTML, and URLs from LLM outputs—or proxy all images server-side, breaking the direct browser-to-attacker link. URL allowlisting restricts rendered content to verified domains. GitHub's decision to completely disable image rendering in Copilot Chat represents the most aggressive approach.
Retrieval access control enforces document-level permissions in RAG systems. Pre-filter approaches query authorization systems (e.g., SpiceDB, used by OpenAI for 37 billion documents and 5 million users in ChatGPT Connectors) before retrieval; post-filter approaches verify permissions after retrieval but before LLM processing. Metadata-based RBAC tags each document chunk with access roles during indexing. The fundamental challenge is the "isolation paradox"—aggressive partitioning for security degrades cross-domain semantic search value.
Differential privacy for embeddings adds calibrated noise: ϕ_noisy(x) = ϕ(x) + λ·ε, ε~N(0,1). Quantization-based approaches offer a hyperparameter-free alternative. PII detection and redaction should operate at both ingestion and output stages, using BERT-based NER models (as in LLM Guard) or Microsoft Purview sensitivity labels. The Samsung ChatGPT leak (March–April 2023)—where three semiconductor engineers leaked source code and meeting transcripts, prompting a company-wide ban on generative AI tools—illustrates the consequences when output-stage controls are absent.
Defense-in-depth requires nine distinct security layers
No single defense is sufficient. The research consistently shows that individual techniques can be bypassed, but layered composition of imperfect defenses provides substantial protection—reducing attack success from 73.2% to 8.7% in controlled studies. The following architecture implements defense at every stage of the RAG/agent pipeline.
Architectural patterns that provide structural security
The most promising architectural defenses shift security guarantees from probabilistic LLM behavior to deterministic system design. CaMeL (CApabilities for MachinE Learning) (Debenedetti et al., Google DeepMind/ETH Zurich, March 2025, arXiv:2503.18813) represents the most significant advance: it wraps the LLM in a capability-based security layer that tracks data provenance using a custom Python interpreter, ensuring untrusted data can never influence control flow. CaMeL solved 77% of AgentDojo tasks with provable security (versus 84% undefended)—a meaningful gap, but one that buys deterministic guarantees rather than probabilistic hopes. Both Simon Willison and Bruce Schneier endorsed the approach.
Type-Directed Privilege Separation (Jacob et al., Google DeepMind, 2025, arXiv:2509.25926) refines the Dual LLM pattern by allowing data to flow from quarantined to privileged LLMs only through restricted types (integers, booleans, enums) that cannot carry prompt injection, achieving zero attack success across three case studies. Prompt Flow Integrity (PFI, 2025, arXiv:2503.15547) provides mandatory access control preventing privilege escalation analogous to the confused deputy problem. Debenedetti et al. formalized six design patterns in "Design Patterns for Securing LLM Agents" (June 2025, arXiv:2506.08837): action-selector, plan-then-execute, LLM map-reduce, dual LLM, code-then-execute, and context-minimization—each offering different security-flexibility tradeoffs.
Monitoring, anomaly detection, and incident response
Real-time monitoring must track tool call sequences, data flow graphs, and behavioral drift. MELON (OpenReview, 2025) detects attacks by re-executing agent tasks with a masked user prompt and comparing actions—divergence indicates injection. DataSentinel (Liu et al., IEEE S&P 2025) applies game-theoretic modeling to prompt injection detection. SEAgent (2025, arXiv:2601.11893) implements mandatory access control using attribute-based policies, monitoring agent-tool interactions via an information flow graph.
Rate limiting should operate at multiple granularities: per-user API rate limits, token-based limits to prevent many-shot jailbreaking, and graduated friction for suspicious patterns. Audit logs must capture every prompt, response, tool call, and retrieved document with sensitivity tags and access metadata. Auto-expiring sensitive context after a configurable number of turns limits the window for multi-turn attacks. Operation Bizarre Bazaar (Pillar Security, December 2025–January 2026)—35,000 attack sessions targeting exposed LLM/MCP endpoints, the first documented systematic campaign with commercial monetization—demonstrates that production monitoring and rapid incident response are not optional.
What remains unsolved: the frontier of AI security research
Formal verification faces fundamental barriers
Traditional formal verification is currently infeasible for LLMs due to their probabilistic, non-deterministic nature. The most promising workarounds shift the verification target. CaMeL verifies the abstraction boundary between LLMs and tools, making the LLM's internal behavior irrelevant to security guarantees. Certified Robustness for RAG (Xiang, Wu, Zhong, Wagner, Chen, and Mittal, 2024, arXiv:2405.15556) provides provable bounds on attacker influence under limited document poisoning. Authenticated Prompts (arXiv:2602.10481, February 2026) proposes cryptographic primitives embedding lineage in prompts with tamper-evident chains and formal policy algebra, claiming 100% detection with zero false positives under its threat model. SmoothLLM (Robey et al., 2023) provides provable guarantees through random perturbation and aggregation, but only against certain attack classes.
The arms race is structurally asymmetric
"The Attacker Moves Second" (2025) formalized the core challenge: defensive evaluations that appear strong in static benchmarks collapse under adaptive attacks. Twelve published defenses were bypassed at >90% rates. Human red-teaming achieved 100%. This asymmetry—where defenders must protect against all possible attacks while adversaries need only find one bypass—is structural and unlikely to be resolved by incremental improvements to detection. The most promising research direction is architectural: making security guarantees independent of LLM behavior, as CaMeL, type-directed privilege separation, and authenticated prompts attempt to do.
Multi-agent systems introduce qualitatively new risks
Multi-agent architectures create attack surfaces that single-agent defenses cannot address. Agent Smith (Gu et al., NeurIPS 2024) demonstrated that a single malicious image can jailbreak one million multimodal agents exponentially fast through inter-agent propagation. Hammond et al. (2025, arXiv:2502.14143) catalog risks from inter-agent communication, shared knowledge bases, and emergent coordination protocols. G-Safeguard (Wang et al., 2025, arXiv:2502.11127) applies topology-guided analysis to multi-agent security. No formal trust frameworks exist for multi-agent delegation—authorization boundaries, escalation control, and failure containment remain open problems.
Automated red teaming is advancing faster than automated defense
Perez, Huang, Song et al. (2022, arXiv:2202.03286) pioneered automated red teaming using LLMs to generate adversarial test cases. Anthropic's follow-up (Ganguli et al., 2022, arXiv:2209.07858) released 38,961 red team attacks and showed RLHF models become increasingly difficult to red-team at scale. OpenAI's "Diverse and Effective Red Teaming with Auto-generated Rewards" (2024) factorizes the problem into diverse goal generation plus RL training. The gap between automated attack generation capability and automated defense capability continues to widen, particularly for multimodal attacks (FlipAttack demonstrated prompt injection through images against GPT-4V, Claude 3, and Gemini in August 2024).
Critical open problems at a glance
- Prompt injection lacks a parameterized-query equivalent. The fundamental instruction-data conflation in LLMs has no known general solution.
- Security-utility tradeoffs have no consensus. CaMeL's 77% vs. 84% task completion gap is non-trivial; the field needs formal frameworks for reasoning about acceptable security costs.
- Multimodal injection is under-defended. Text-only defenses do not transfer to continuous input domains (images, audio), which are often easier to attack.
- MCP and agent protocol security is immature. ETDI, MCP-Scan, and Trail of Bits' mcp-context-protector represent early efforts, but cryptographic tool attestation, change detection, and cross-server isolation remain undeployed at scale.
- Instruction hierarchies are unreliable. Control Illusion (AAAI 2026) demonstrates that trained hierarchies fail in practice, with models prioritizing implicit societal hierarchies over explicit role designations.
- Monitoring for slow, multi-turn, cross-session attacks is nascent. The persistent memory poisoning attack against ChatGPT (Rehberger, 2024) showed that exfiltration can persist across sessions—detection requires longitudinal behavioral analysis that few production systems implement.
Conclusion: deterministic architecture, not probabilistic detection
The central lesson from three years of attacks on RAG and agent systems is that probabilistic defenses fail against adaptive adversaries. Every guardrail, classifier, and detection model tested in adversarial conditions has been bypassed at rates unacceptable for security-critical applications. The path forward lies in deterministic architectural controls—capability-based security (CaMeL), type-directed privilege separation, authenticated prompts, and mandatory access control (SEAgent)—that make security guarantees independent of LLM behavior. These approaches accept a meaningful utility cost (7–23% task completion reduction in current implementations) in exchange for provable properties.
In the interim, defense-in-depth remains essential: spotlighting reduces attack success rates by 50× at the data boundary; constitutional classifiers cut jailbreak success to 4.4% at the model boundary; output sanitization eliminates the most dangerous exfiltration channels; and tool pinning with least-privilege access limits blast radius. No layer is sufficient alone. Composition is mandatory.
The threat landscape will intensify as agent autonomy increases, tool ecosystems expand via MCP and similar protocols, and multimodal inputs create new injection surfaces. Organizations deploying RAG/agent systems should treat the OWASP LLM Top 10 and Agentic AI Top 10 as minimum baselines, implement monitoring for the specific attack patterns documented in this article, and invest in architectural patterns that provide structural rather than statistical security. The field's trajectory points toward a future where the LLM is treated as an untrusted component within a verified execution environment—much as user input has been treated in web security for the past two decades. The sooner that architectural paradigm shift occurs, the sooner the industry can move beyond the current arms race.