Back to Explainers
AI SecurityExplainerApril 29, 2026Yellow — detail controls

What Is RAG Data Exfiltration? Retrieval-Channel Attacks on LLM Apps

Quick Answer

RAG data exfiltration is a class of attack where a retrieval-augmented generation system leaks sensitive data because of what it retrieved. An attacker plants instructions or formatting payloads in the corpus; a legitimate user query pulls them into the model's context; the model echoes attacker text, emits markdown that smuggles data out, or repeats the system prompt. The vector store becomes an instruction surface.

What Is RAG Data Exfiltration? Retrieval-Channel Attacks on LLM Apps

Retrieval-augmented generation (RAG) is the dominant pattern for grounding LLM applications in private data, and it has a structural weakness: the retrieval channel is an instruction channel. An attacker who can influence what ends up in the corpus can influence what the model does on a future query, for a future user. This page is about that attack class, what it costs to defend against, and what the realistic controls look like.

What is RAG data exfiltration?

A RAG system fetches documents from a vector store at query time and inserts them into the model's context window so the model can answer with grounded source material. RAG data exfiltration is the class of attack where the model leaks data — secrets, private documents, prior-user content, or its own system prompt — because of what was retrieved. Either the retrieved chunk carries an instruction the model follows, or it steers the model into formatting its answer so the answer itself becomes the exfiltration channel.

The cleanest framing: the vector store is an instruction surface, not a knowledge surface. Anything that lands in a chunk that lands in the context window can change what the model does next.

This is a specific shape of indirect prompt injection; the carrier is a retrieved chunk, the trigger is a legitimate user query that happens to match it, and the exfiltration target is whatever else the model can see in that turn.

How does it work?

A representative attack proceeds in two phases. The attacker may control either.

  1. Ingest-time corpus poisoning. The attacker plants content the indexer will process. Channels include any document an employee can drop into a crawled shared drive, any web page the scraper fetches, any support ticket or comment the application stores in its knowledge base, and any partner-supplied feed the pipeline consumes. Bulk write access is not required. Published research shows that a small number of carefully crafted texts — single-digit chunks against millions of benign chunks — is enough to dominate retrieval for a target query.
  2. Query-time firing. A legitimate user asks a question whose semantic neighborhood overlaps the poisoned chunk. The retriever returns it alongside genuine sources. Nothing in the retrieval pipeline changes the privilege of the chunk's tokens, so the model reads its instructions as instructions.

Once the poisoned chunk is in context, the model has three ways to leak:

  • Direct echo. The answer includes attacker-controlled content because the chunk told it to. The user reads attacker text presented as authoritative source material.
  • Side-channel formatting. The model emits a markdown image or rich-text element whose URL embeds data the model can see — the user's prompt, other retrieved chunks, the system prompt, prior conversation. When the user's client renders the markdown, the browser fetches the URL and the embedded data leaves the network. Image fetches do not require a click; this is the canonical zero-click pattern.
  • Behavioral steering. The model takes an action it would not otherwise have taken, and the action's side effects move data outside the trust boundary. This shades into tool-call abuse and is treated as a separate attack class.

Two adjacent mechanisms are worth naming. Embedding inversion sits underneath the corpus: published work shows that for short documents, the original text can be reconstructed from the stored embedding with high fidelity. A vector store holding embeddings of sensitive documents is, under a threat model where the vectors are readable, holding the documents themselves. Retrieval-rank manipulation targets the retriever rather than the model; attackers craft chunks whose embeddings are tuned to dominate dense-retrieval similarity for a target query class, displacing legitimate sources so the model "grounds" its answer in attacker text.

Three properties separate retrieval-channel attacks from interactive prompt injection. Persistence: a poisoned chunk fires on every matching query, for every user, until something removes it. Provenance laundering: by the time the model sees the chunk, the retrieval pipeline has blessed it; the model has no signal that it came from a less-trusted source. Cross-tenant blast radius: a single poisoned chunk in a multi-tenant corpus affects every tenant whose query touches it, unless retrieval is partitioned per tenant — and partitioning fights the cross-domain semantic search the corpus exists to enable.

Why does it matter?

The blast radius is shaped like the corpus and the model's other context. A typical enterprise RAG application gives an attacker who lands a poisoned chunk:

  • Data exfiltration via formatting. Anything in the model's context — the user's question, other retrieved chunks, system prompt, prior turns — can be encoded into an outbound URL the user's client fetches.
  • Misinformation under authority. The system presents attacker text as if it came from the official knowledge base. Customers, employees, or downstream automation consume it as fact.
  • System prompt and configuration leakage. A poisoned chunk can ask the model to repeat its instructions; if those instructions contain credentials, internal endpoints, or policy logic, those leave with the answer.
  • Cross-user data leaks. In multi-tenant systems where retrieval is shared, one tenant's poisoned chunk can steer answers given to other tenants, and one tenant's documents can be retrieved into another tenant's context if access controls are not enforced at the retrieval layer.
  • Persistent compromise. Unlike a one-off jailbreak, a poisoned chunk fires every time the corpus is queried. The compromise outlives any individual session.

The audit trail is usually thin. Most RAG systems log what the model answered, sometimes which chunks were retrieved, and rarely which chunk drove which token. Operators typically learn about a poisoning incident from a downstream symptom — a strange customer-facing answer, an outbound request to an unfamiliar domain in egress logs — rather than from the retrieval layer itself.

This explainer covers class behavior. Working retrieval-injection payload strings, vendor-specific bypasses tied to named vector databases or RAG products, and exploit chains that pair retrieval poisoning with credential theft are withheld pending vendor coordination on related disclosures.

How do you defend against it?

The structural fix is to stop treating retrieved text as content and start treating it as untrusted input — at every layer of the pipeline. In rough order of leverage:

  1. Treat retrieved chunks as data, not instructions. Render every retrieved chunk inside an explicit untrusted-content envelope in the prompt and instruct the model that text inside the envelope is reference material, not directive. Cost: prompt-engineering work, slightly more tokens. Does not cover: a sufficiently persuasive instruction that the model follows anyway. Pair with downstream controls.
  2. Sanitize model output, especially markdown and links. Markdown image rendering is the single highest-leverage exfiltration channel. Strip or proxy markdown images, allowlist URL domains in rendered output, and disable rich-text rendering for tokens that did not originate from a trusted source. Cost: product polish; users cannot inline arbitrary images. Does not cover: exfiltration through other channels such as tool calls.
  3. Enforce retrieval access control. Document-level permissions must be enforced at retrieval time, not just at storage time. Pre-filter the candidate set against the calling user's identity, or post-filter retrieved chunks against permission metadata before they enter the prompt. Cost: identity plumbing into the retrieval layer, possible relevance loss across partitions. Does not cover: attacks where the legitimate caller is the one being phished by their own retrieved content.
  4. Monitor corpus integrity and chunk provenance. Tag every chunk with its source, ingest timestamp, ingest principal, and content hash. Anomaly-score new chunks against the existing distribution; chunks whose embeddings sit in unusual regions of latent space deserve human review before they go live. Re-scan periodically; documents that are legitimate at ingest and modified later are a known pattern. Cost: indexing and storage overhead. Does not cover: attacker content statistically indistinguishable from legitimate content.
  5. Constrain ingest sources. Inventory every ingest path. A public GitHub issue, a partner-supplied feed, a customer support transcript, and an internal wiki page have different trust properties even when the indexer treats them identically. Per-source policies, such as not indexing user-generated content into the same partition as official documentation, cut the blast radius. Cost: operational discipline, sometimes loss of useful coverage. Does not cover: poisoned content arriving through a source you do trust.
  6. Protect embeddings as if they were the documents. If embedding inversion is in scope, treat vectors as sensitive material. Apply quantization, differential-privacy noise calibrated against retrieval quality, or per-tenant encryption at rest. Cost: retrieval quality tradeoff, key-management complexity. Does not cover: attackers who can also query the model and read its answers.
  7. Log retrieval and output together. Capture the retrieval set, chunk-to-output influence, and any URLs the model emitted side-by-side with the user's question and the final answer. Cost: storage, privacy review. Does not cover: prevention; this is for diagnosis after the fact.
  8. Apply a content security policy on the rendering client. The chat client is the last line of defense. CSP that restricts which domains can be fetched, server-side image proxying, and stripping outbound links from model output close the markdown-image side channel even when the model misbehaves. Cost: product friction on legitimate inline media. Does not cover: attackers who exfiltrate through allowed domains.

Classifier-based guardrails that try to detect prompt-injection content inside retrieved chunks help at the margin and reduce frequency. They do not change the structural property that the model treats retrieved tokens as instructions, and adaptive attacks routinely bypass them. Treat them as defense-in-depth, not as the primary control.

Related concepts and tools

FAQ

Is RAG poisoning the same as training-data poisoning?

No. Training-data poisoning corrupts model weights at training time and persists in the model itself. RAG poisoning corrupts the retrieval corpus at runtime; the model is untouched, and the poisoned context is read fresh on every query. The two attack classes share a name but have different blast radii, different remediation paths, and different defenders.

Can I prevent this by trusting only my own documents?

Helps, but it is not a control on its own. Most enterprise RAG corpora ingest user-generated content, support tickets, scraped pages, partner feeds, or anything an employee can drop into a shared drive the indexer crawls. Internal-only is a perimeter, not a guarantee. Per-source partitioning and ingest review still apply.

How do I detect retrieval-channel attacks?

Detection sits at three places. At ingest, anomaly-score new chunks against the existing corpus distribution and tag provenance. At retrieval, log which chunks were returned for which query and which user. At output, sanitize markdown images and outbound links before the user's client renders them, and capture any URLs the model emitted alongside the answer.

Derived From

Related Work

External References

FAQ

Is RAG poisoning the same as training-data poisoning?

No. Training-data poisoning corrupts model weights at training time and persists in the model itself. RAG poisoning corrupts the retrieval corpus at runtime; the model is untouched, and the poisoned context is read fresh on every query. The two attack classes share a name but have different blast radii, different remediation paths, and different defenders.

Can I prevent this by trusting only my own documents?

Helps, but it is not a control on its own. Most enterprise RAG corpora ingest user-generated content, support tickets, scraped pages, partner feeds, or anything an employee can drop into a shared drive the indexer crawls. Internal-only is a perimeter, not a guarantee. Per-source partitioning and ingest review still apply.

How do I detect retrieval-channel attacks?

Detection sits at three places. At ingest, anomaly-score new chunks against the existing corpus distribution and tag provenance. At retrieval, log which chunks were returned for which query and which user. At output, sanitize markdown images and outbound links before the user's client renders them, and capture any URLs the model emitted alongside the answer.