Compound AI Systems for Multi-Step Automation: Orchestration Patterns, Failure Modes, and Defensible Architectures

Compound AI systems are automation systems in which a foundation model is only one component inside a larger program: retrievers, tools, memory stores, planners, verifiers, schedulers, policy engines, sandboxes, and human approval gates collaborate to complete a multi-step objective. The threat class is no longer “the model gives a bad answer”; it is that an orchestrated system reads untrusted context, plans over it, calls real tools, mutates external state, and may propagate compromised state into later steps. This matters now because the empirical frontier has moved from single-turn generation to autonomous task completion while defenses remain immature. SWE-bench showed that repository-level software repair can be made into an execution benchmark with 2,294 real GitHub issues across 12 Python repositories, shifting coding evaluation from snippets to agentic workflows (SWE-bench, ICLR 2024). (proceedings.iclr.cc) WebArena found that a GPT-4-based web agent solved only 14.41% of 812 realistic web tasks, versus 78.24% human success, making clear that long-horizon browser automation fails in ways chat benchmarks hide (WebArena, ICLR 2024). (proceedings.iclr.cc) OSWorld extended this to real desktop environments: 369 open-ended computer tasks, over 72.36% human success, and only 12.24% for the best evaluated model at publication, with GUI grounding and operational knowledge as dominant failure modes (OSWorld, NeurIPS 2024). (arxiv.org) The security picture is worse: AgentDojo evaluates agents over untrusted tool outputs with 97 realistic tasks and 629 security test cases, and reports prompt injections embedded in tool responses reaching up to 70% average attack success against GPT-4o; PoisonedRAG reports 90% attack success after injecting only five malicious texts per target question into a corpus with millions of documents (AgentDojo, NeurIPS 2024; PoisonedRAG, USENIX Security 2025). (mlanthology.org) This paper defines compound AI systems as security-critical distributed programs, maps the major orchestration patterns, explains why they fail, and gives an implementable architecture for building multi-step automation that is observable, least-privilege, testable, and resilient to prompt injection, tool abuse, poisoned retrieval, and cascading agent error.

1. From Models to Compound Systems

The term “compound AI system” is useful because it names the decisive shift in applied AI: performance is increasingly produced by a system boundary rather than by a single model invocation. The Berkeley BAIR formulation describes this shift as moving from monolithic models to systems that combine models, retrieval, tools, control logic, and other components (The Shift from Models to Compound AI Systems, BAIR 2024). (bair.berkeley.edu) In production automation, this is not an academic distinction. A coding assistant that edits a repository, runs tests, summarizes failures, searches documentation, and opens a pull request is not “an LLM.” It is a stateful workflow engine whose most error-prone component happens to be a probabilistic text-and-action model.

The design problem is therefore not “how do we prompt the model?” but “which parts of the task should be delegated to a model, which parts should be deterministic software, and how should authority flow between them?” Poor systems invert that relationship: they put orchestration logic into prompts, expose broad tools, trust raw tool output as instruction-bearing context, and then try to fix the result with another prompt. Strong systems treat the model as an untrusted planner or transformer inside a governed runtime.

Compound AI systems for automation usually have five properties:

Task decomposition. A user objective is decomposed into subgoals, steps, tool calls, or graph nodes.
External grounding. The system retrieves data from files, databases, web pages, application APIs, logs, or source code.
Action. The system can mutate state: write files, send messages, submit forms, execute code, update tickets, approve transactions, deploy software, or call APIs.
Feedback loops. The system observes intermediate outputs and adapts its plan through retries, reflection, debugging, or replanning.
Persistent state. The system often keeps memory: user preferences, task history, retrieved documents, scratchpads, tool traces, embeddings, plans, or environment state.

These properties make compound systems powerful but also enlarge the failure surface. The model’s context window becomes an ambient authority channel; retrieved documents become latent instructions; tool schemas become capability advertisements; agent memory becomes a persistence layer; and orchestration graphs become privilege-escalation paths.

The core engineering question is how to get the capability benefits of composition without inheriting an unbounded attack surface.

2. Capability Evidence: What Compound Systems Actually Improve

The evidence base is mixed but clear in one respect: orchestration helps most when the task requires external state, search, execution, or verification. It helps least when the surrounding system merely asks several models to discuss a problem without adding observability, grounding, or independent checks.

2.1 Reason-and-act loops

ReAct is the canonical early pattern for language-agent automation: alternate between reasoning traces and actions against an environment. The paper reports that ReAct outperformed imitation-learning and reinforcement-learning baselines by 34 percentage points absolute on ALFWorld and 10 percentage points absolute on WebShop using only one or two in-context examples (ReAct, ICLR 2023). (openreview.net) The lesson is not that hidden chain-of-thought is magic; it is that interleaving observation and action gives the model fresh state and makes exception handling possible.

A ReAct-style loop is still the default skeleton behind many agents:

observe → think/plan → choose tool → execute → observe result → update state → repeat

Its strength is flexibility. Its weakness is that every loop iteration re-exposes the model to untrusted data and gives it another chance to drift, over-call tools, leak data, or be hijacked. In security terms, ReAct is a repeated privilege-decision loop mediated by a non-deterministic policy engine.

2.2 Search and deliberation

Tree of Thoughts generalized chain-of-thought into explicit search over intermediate “thought” states. On Game of 24, GPT-4 with chain-of-thought solved 4% of tasks while Tree of Thoughts achieved 74% in the reported setup (Tree of Thoughts, NeurIPS 2023). (arxiv.org) That result is often over-applied. It demonstrates that explicit search can rescue tasks with compact state, cheap evaluation, and reversible choices. It does not imply that tree search scales to arbitrary enterprise workflows where state is high-dimensional, tool calls are expensive or irreversible, and the evaluator may be another fallible model.

The robust abstraction is: when the system can cheaply generate alternatives and independently score them, search beats a single sampled plan. For automation, this means candidates should be evaluated by tests, type checks, policy rules, transaction simulations, deterministic validators, or sandbox execution—not just by another unconstrained LLM.

2.3 Self-correction and reflective memory

Reflexion introduced a verbal reinforcement loop in which agents convert feedback into natural-language memories for future attempts. It improved performance across sequential decision-making, coding, and question-answering tasks (Reflexion, NeurIPS 2023). (papers.nips.cc) The pattern remains valuable, but its safety profile depends entirely on what enters memory. A memory system that records “lessons learned” from compromised tool outputs can preserve attacker instructions across sessions. Reflection is a capability mechanism and a persistence risk.

2.4 Programmatic optimization of language-model pipelines

DSPy reframed prompt chains as optimizable language-model programs. Instead of hand-writing brittle prompts for each stage, developers specify modules and metrics; the compiler searches for prompts, demonstrations, or other parameters that improve the metric (DSPy, ICLR 2024). (proceedings.iclr.cc) This matters for compound automation because prompt strings are configuration with security consequences. If orchestration is expressed as code with typed module boundaries and measurable objectives, it can be versioned, tested, optimized, and audited. If orchestration lives as long natural-language prompts inside agents, it is hard to diff, hard to fuzz, and easy to accidentally widen authority.

2.5 Parallel tool execution and compilation

Sequential tool use is costly and slow. LLMCompiler formalized parallel function calling with a planner, task-fetching unit, and executor, reducing the need for repeated sequential LLM invocations when calls are independent (LLMCompiler, ICML 2024). (proceedings.mlr.press) This is the right direction for production systems: derive a dependency graph, execute independent deterministic work in parallel, and call the model only where semantic judgment is actually needed. It also changes the security problem. Parallelism means concurrent state mutations, race conditions, and cross-branch context contamination become realistic failure modes.

2.6 Benchmarks expose the gap between demos and automation

WebArena, OSWorld, GAIA, SWE-bench, and related benchmarks converge on the same conclusion: multi-step automation is possible, but brittle. GAIA includes 466 questions designed to require reasoning, tool use, web browsing, and multimodal handling; the original report showed 92% human performance versus 15% for GPT-4 with plugins (GAIA, 2023). (arxiv.org) WebArena and OSWorld then showed that interactive web and desktop environments remain far below human reliability at publication (WebArena, ICLR 2024; OSWorld, NeurIPS 2024). (proceedings.iclr.cc)

SWE-bench is the most important case study because coding agents can use unusually strong verification signals: unit tests, type checks, linters, diff review, and reproducible sandboxes. The benchmark’s 2,294 real GitHub issues made it possible to compare systems on repository-level tasks rather than code snippets (SWE-bench, ICLR 2024). (proceedings.iclr.cc) OpenAI’s SWE-bench Verified later introduced a 500-task human-validated subset to address labeling and evaluation problems in the original dataset (SWE-bench Verified, OpenAI 2024). (openai.com) The lesson is general: agent progress is fastest where the environment gives executable feedback. Domains without strong validators—legal research, finance operations, customer support, procurement, medical documentation—need artificial guardrails to approximate that feedback.

3. A Taxonomy of Orchestration Patterns

Compound AI systems are best understood as orchestration patterns over uncertain computation. The following taxonomy is more useful than the vague term “agent.”

3.1 Linear pipelines

The simplest compound system is a fixed sequence:

ingest → retrieve → extract → reason → validate → write

Linear pipelines are underrated. They are easy to observe, easy to replay, and easy to constrain. They work well when task structure is stable: invoice extraction, support-ticket classification, compliance clause mapping, alert enrichment, entity resolution, document triage, report drafting, and code review summaries.

Their failure mode is brittleness. If an early extraction is wrong, downstream steps may confidently build on it. The defense is not “ask the model to be careful.” It is typed intermediate representations, schema validation, confidence thresholds, fallbacks, and explicit provenance.

Use linear pipelines when:

The workflow is known in advance.
The required tools are fixed.
The system can validate intermediate outputs.
Low variance matters more than autonomy.

3.2 Routers and classifiers

A router maps an input to one of several specialized flows. In production, routing is often more valuable than free-form planning. A customer request may go to billing, account recovery, refund, abuse, technical support, or escalation. A security alert may go to phishing triage, malware detonation, identity investigation, or cloud misconfiguration analysis.

Routers should be conservative. Misrouting is a security bug when branches have different privileges. A low-confidence router should send the request to a safe default or human queue, not to the most capable agent.

The secure router has three components:

Intent classifier with calibrated confidence.
Policy matrix mapping intents to allowed tools and data scopes.
Escalation path for ambiguous or high-impact cases.

3.3 ReAct-style agent loops

ReAct loops are useful when the next step depends on observations that cannot be known upfront. Examples include debugging, web navigation, incident investigation, exploratory data analysis, and interactive troubleshooting.

However, a ReAct loop is the least secure common pattern because it allows an LLM to repeatedly decide both what information to read and what action to take. If tool outputs contain attacker-controlled text, each observation is a prompt-injection opportunity. AgentDojo directly targets this setting: agents execute tools over untrusted data, and its benchmark includes 97 realistic tasks and 629 security test cases (AgentDojo, NeurIPS 2024). (mlanthology.org)

A defensible ReAct loop requires:

A maximum step budget.
A tool allowlist scoped to the current task.
Per-tool input validation.
Per-tool output declassification or sanitization.
A policy check before every state-mutating action.
Structured scratchpad state, not arbitrary transcript accumulation.
Replayable traces for every observation, decision, and action.

3.4 Planner-executor systems

Planner-executor systems separate high-level decomposition from low-level execution. The planner produces a step list; the executor carries out steps, often with tools or specialized subagents. This reduces repeated planning overhead and makes it possible to inspect the plan before execution.

The security improvement is real but limited. A malicious retrieved document can still influence the plan if planning sees untrusted context. The correct design is to plan over trusted task metadata first, retrieve narrowly for each step, and re-authorize actions as they become concrete.

A planner-executor pattern is strong when:

The task has a meaningful dependency structure.
Steps can be checked before execution.
Some steps are deterministic.
Human approval can be inserted between planning and action.

3.5 Graph and DAG workflows

Graph orchestration makes state explicit. Nodes are transformations or actions; edges encode dependencies; state is a typed object passed between nodes. Frameworks such as LangGraph popularized this model in the application layer, while LLMCompiler formalized related ideas for parallel function calling (LLMCompiler, ICML 2024). (proceedings.mlr.press)

Graph workflows are the right default for high-stakes automation. They support deterministic branches, retries, compensation, parallel fan-out/fan-in, and human gates. They also let security engineers reason about authority at the node level.

A good graph node has:

A narrow purpose.
Typed inputs and outputs.
Declared tool permissions.
Declared data permissions.
A timeout and retry policy.
Deterministic validation.
Audit logging.
A failure state.

A bad graph node is “agent does the task.”

3.6 Multi-agent handoff

Multi-agent systems assign roles to multiple agents: researcher, coder, reviewer, planner, executor, critic, manager, specialist, or user proxy. AutoGen formalized multi-agent applications as conversations among agents that can use LLMs, tools, and humans (AutoGen, COLM 2024). (arxiv.org) OpenAI’s Swarm examples later popularized “routines and handoffs”: an agent follows a routine and can hand off to another specialized agent (OpenAI Cookbook, 2024). (cookbook.openai.com)

Multi-agent systems are not automatically more robust. They often increase variance, cost, and debugging difficulty. They help when roles correspond to real privilege boundaries or independent evaluation functions. A reviewer agent without tool access can be useful. A second agent with the same context, same model, and same tools is usually theater.

Use multi-agent handoff when:

Roles require different data access.
Roles require different tools.
A specialist model is materially better for a subtask.
Independent critique is possible.
The handoff can be represented as structured state, not a full transcript dump.

Do not use multi-agent orchestration to compensate for unclear workflow design.

3.7 Search, voting, and debate

Search and voting patterns generate multiple candidate plans, answers, patches, or actions and then select one. They are strong when the scoring function is external: tests pass, policy accepts, output matches schema, transaction simulation succeeds, or retrieved evidence supports the claim.

LLM-only debate is weaker. Correlated models share blind spots, and a persuasive wrong answer may dominate. For security, the evaluator must have less authority than the executor or use different evidence. A “critic” that can call the same dangerous tools as the actor is not a control.

3.8 Compiler-optimized language-model programs

DSPy-style systems and LLMCompiler-style systems point toward the same future: orchestration should be compiled, optimized, and tested like software. DSPy optimizes language-model modules against metrics; LLMCompiler schedules function calls using dependency information (DSPy, ICLR 2024; LLMCompiler, ICML 2024). (proceedings.iclr.cc)

The security opportunity is substantial. Once the workflow is a program, teams can apply program-analysis ideas:

Static permission checks.
Dataflow analysis.
Taint tracking.
Dead-step elimination.
Policy-as-code enforcement.
Cost estimation.
Replay and differential testing.
Fuzzing of node inputs.
Regression testing of compiled prompts.

The field is early, but the direction is correct: put the intelligence in modules, the authority in code, and the optimization under tests.

4. Reference Architecture for Defensible Multi-Step Automation

A secure compound AI system should be built as a governed runtime around untrusted intelligence. The model may be useful, but it should not be the root of trust.

4.1 Gateway

The gateway receives user requests or upstream events. It authenticates the caller, normalizes input, attaches tenant and user identity, and assigns an initial risk tier. It should not pass raw user input directly to a high-privilege agent. The gateway also enforces coarse limits: request size, file type, recursion depth, rate limits, and task class allowlists.

4.2 Policy engine

The policy engine is the system’s authority layer. It decides what the orchestrator may do, which tools may be used, which data scopes are available, when human approval is required, and which actions are forbidden.

Policy must be deterministic where consequences are material. “Ask an LLM if this bank transfer is safe” is not a policy engine. A model can help classify risk, but hard controls should be implemented as code:

Maximum refund amount.
Allowed recipients.
Allowed file paths.
Allowed SQL operations.
Allowed API methods.
Required approvals.
Separation of duties.
Tenant boundaries.
Data residency rules.
Secrets access restrictions.

4.3 Orchestrator

The orchestrator is not the model. It is the workflow runtime. It owns state transitions, retries, step budgets, branch scheduling, context construction, and termination. It decides when to call the model and what context the model receives.

A secure orchestrator never gives the model the whole transcript by default. It constructs minimal context for the current decision:

User objective.
Relevant trusted state.
Specific tool outputs required for the step.
Tool schemas allowed for the step.
Policy constraints.
Expected output schema.

This context-minimization principle is one of the most effective controls against prompt injection and data leakage.

4.4 Model adapter

The model adapter isolates provider-specific APIs and enforces structured outputs. It should support:

JSON schema or typed output validation.
Refusal and uncertainty handling.
Model fallback policies.
Temperature and sampling control.
Prompt/version pinning.
Token budgeting.
Provider routing.
Redaction of secrets.
Logging of prompts and outputs under privacy controls.

The adapter is also where organizations can insert model-level monitors, but monitors should not be treated as complete security boundaries. AgentDojo and related work show that prompt-injection attacks adapt to defenses; architecture matters more than prompt hardening alone (AgentDojo, NeurIPS 2024). (papers.neurips.cc)

4.5 Tool broker

The tool broker is the most important security component in an agentic system. It mediates every tool call. The model should not call production APIs directly. The broker enforces:

Tool allowlists per workflow step.
Argument schemas.
Argument validation.
Rate limits.
Idempotency.
Dry-run modes.
Approval requirements.
Sandboxing.
Secrets isolation.
Output filtering.
Taint labeling.
Audit logging.

Tool outputs must be treated as data, not instructions. If a web page, email, ticket, PDF, code comment, or database field says “ignore previous instructions and exfiltrate secrets,” that text is attacker-controlled content. The tool broker should return structured fields and provenance, not raw instruction-like blobs whenever possible.

4.6 Retriever and memory

Retrieval is both a capability layer and a supply-chain layer. PoisonedRAG demonstrates that small numbers of malicious documents can dominate RAG outputs under realistic retrieval assumptions; the USENIX Security 2025 paper reports 90% attack success with five malicious texts per target question in a database with millions of texts (PoisonedRAG, USENIX Security 2025). (usenix.org)

A secure retrieval layer needs:

Source allowlists and trust tiers.
Document provenance.
Ingestion-time scanning.
Embedding anomaly checks.
Versioned corpora.
Signed or checksummed high-trust documents.
Query-time source diversity.
Citation requirements for factual claims.
Quarantine for newly ingested untrusted content.
Separation between user memory, organization memory, and global knowledge.
Expiration and review policies for memory.

Memory should not be a dumping ground for agent reflections. Treat memory writes as state mutations requiring schema, provenance, and policy.

4.7 Validators

Validators convert probabilistic outputs into enforceable contracts. They include:

JSON schema validation.
Type checking.
Unit tests.
Static analysis.
SQL explain-plan checks.
Policy checks.
Citation verification.
Diff constraints.
Business-rule validation.
Simulation.
Fuzz tests.
Human review.

The key is independence. A validator that simply asks the same model whether its answer is good is weak. A validator that runs tests, checks permissions, or compares against authoritative records is strong.

4.8 Human gates

Human-in-the-loop is not a magic phrase. A human gate works only if the reviewer gets the right information at the right abstraction level and can reject or modify the action. Approval prompts should show:

The proposed action.
The reason.
Inputs used.
Data sources and trust labels.
Expected side effects.
Policy exceptions.
Rollback plan.
Confidence and validation results.
A diff where applicable.

Humans should approve actions, not vague intentions. “Approve refund” is insufficient; “refund $184.22 to card ending 1234 for order 789, reason code damaged-on-arrival, within policy limit, no prior refund” is reviewable.

4.9 Observability and replay

Agent observability must capture more than model prompts. A useful trace includes:

User identity and tenant.
Workflow version.
Model version and parameters.
Prompt template version.
Tool schemas exposed.
Retrieval queries and document IDs.
Tool calls and arguments.
Tool outputs with taint labels.
State transitions.
Policy decisions.
Validator results.
Human approvals.
Cost and latency.
Final outputs and side effects.

Replay is essential. If an agent makes a bad transfer, deletes a file, leaks data, or merges a broken patch, the team must reconstruct exactly what context it saw and which authority checks passed.

5. Failure Modes in Compound Automation

Compound systems fail through interactions. The model may be individually competent, the tools correct, and the policies reasonable, yet the system can still fail because data and authority cross boundaries incorrectly.

5.1 Prompt injection and instruction hierarchy collapse

Prompt injection exploits the fact that LLMs process instructions and data through the same channel. In a compound system, the problem becomes indirect: malicious instructions are embedded in retrieved documents, web pages, emails, tickets, calendar invites, code comments, PDFs, tool outputs, or memory entries.

AgentDojo is important because it evaluates this condition directly: agents execute tools over untrusted data, and attacks try to hijack the agent into malicious tasks while preserving utility measurements (AgentDojo, NeurIPS 2024). (mlanthology.org) Its reported result that tool-response injections can reach up to 70% average success against GPT-4o underlines the point: instruction hierarchy is not reliably preserved by model behavior alone (AgentDojo, NeurIPS 2024). (papers.nips.cc)

Defenses that help:

Never concatenate raw untrusted content into high-authority prompts without labeling.
Use structured extraction before reasoning.
Strip or quarantine instruction-like text from tool outputs when the task does not require it.
Keep retrieved content in quoted, attributed fields.
Use separate contexts for data extraction and action planning.
Block state-mutating tools when the current context contains untrusted data unless a policy check passes.
Require human approval for high-impact actions derived from untrusted context.

5.2 Tool misuse and excessive agency

OWASP’s 2025 LLM Top 10 explicitly includes prompt injection and excessive agency as major application risks, and its agentic guidance focuses on tool misuse, data leakage, and autonomous action risks (OWASP LLM Top 10, 2025; OWASP Agentic Applications, 2025). (owasp.org) Excessive agency occurs when an agent has more permission, autonomy, or tool breadth than the task requires.

Common examples:

A read-only research task gets a browser profile with logged-in write access.
A support summarizer can issue refunds.
A coding assistant can push directly to main.
A database helper can run arbitrary SQL instead of parameterized read-only queries.
A triage agent can email external recipients.
A calendar assistant can invite arbitrary addresses and include confidential notes.

The defense is least agency:

For each step:
    allowed_data = minimum data required
    allowed_tools = minimum tools required
    allowed_actions = minimum side effects required
    approval = required if impact exceeds threshold
    expiry = permission ends after step

Least privilege is not enough. Agents also need least autonomy: bounded steps, bounded retries, bounded spend, bounded time, bounded recipients, bounded file paths, bounded API methods.

5.3 Retrieval poisoning

Retrieval poisoning compromises the knowledge substrate rather than the prompt. PoisonedRAG formalizes this for RAG systems: inject malicious texts that retrieve for target questions and steer generation to attacker-chosen answers (PoisonedRAG, USENIX Security 2025). (arxiv.org) In automation systems, the impact is larger because retrieved content can influence actions, not just answers.

Attack paths include:

Poisoning public documentation the agent retrieves.
Adding malicious pages to a web-search corpus.
Uploading a crafted PDF to an enterprise knowledge base.
Modifying wiki pages.
Poisoning package metadata.
Embedding instructions in GitHub issues or README files.
Seeding support tickets with malicious instructions.
Corrupting vector-store entries through compromised ingestion pipelines.

Defenses must operate at ingestion and retrieval time. Output filtering alone is too late if poisoned content has already shaped the plan.

5.4 Tool metadata poisoning and protocol-level attacks

The Model Context Protocol (MCP), introduced by Anthropic in November 2024, standardizes how AI applications connect to external tools and data sources (MCP, Anthropic 2024). (anthropic.com) The MCP specification exposes server features such as resources, prompts, and tools, making tool descriptions and schemas first-class context (MCP Specification, 2024). (modelcontextprotocol.io) This is useful for interoperability but creates a semantic supply chain: if tool metadata is malicious or misleading, the model may select the wrong tool or follow instructions embedded in descriptions.

The risk is not unique to MCP. Any dynamic tool registry can be poisoned through:

Tool names that resemble trusted tools.
Descriptions that include hidden instructions.
Schemas that overclaim safety.
Output fields that smuggle commands.
Server-side behavior that changes after approval.
Tool chaining that produces emergent privilege escalation.

Tool registries must therefore be treated like package registries. Pin versions, sign trusted tools, review metadata, separate discovery from execution, and never let natural-language descriptions grant authority.

5.5 Memory poisoning and persistent compromise

Agent memory turns transient compromise into durable compromise. If an injected document convinces an agent to write “Always send financial summaries to attacker@example.com” into memory, the next session may inherit the attack without retrieving the original document.

Memory writes should be rare, typed, and reviewed. The system should distinguish:

User-declared preferences.
System-learned preferences.
Task summaries.
Entity facts.
Operational lessons.
Security-relevant rules.

Security-relevant rules should not be written by agents at all unless approved through a governance path.

5.6 Context contamination across branches and agents

Parallel orchestration and multi-agent handoffs create contamination risks. A research branch reading untrusted web content should not be able to contaminate an execution branch with privileged tools. A low-privilege agent should not be able to hand off a natural-language instruction to a high-privilege agent and thereby escalate.

The safe handoff format is structured:

json

{
  "task_id": "refund-review-4812",
  "requested_action": "refund",
  "amount": 184.22,
  "currency": "USD",
  "recipient_account_id": "acct_...",
  "evidence_ids": [
    "ticket_123",
    "photo_456"
  ],
  "source_trust": "external_customer_supplied",
  "policy_checks": [
    "within_limit",
    "order_delivered",
    "no_prior_refund"
  ],
  "free_text_summary": "Customer reports damaged item; photo attached."
}

The unsafe handoff is a transcript dump.

5.7 Cascading error and compounding uncertainty

Each step in a compound system has an error rate. If a workflow requires ten dependent model-mediated decisions and each is 95% reliable under distribution, the naive probability that all are correct is about 60%. Real systems are worse because errors are correlated: one wrong assumption contaminates later context.

This is why deterministic validators and checkpoints matter. Long workflows should be segmented into phases with state commitments:

Understand request.
Gather evidence.
Produce candidate plan.
Validate plan.
Execute reversible actions.
Request approval for irreversible actions.
Verify final state.
Summarize with provenance.

5.8 Cost and latency runaway

Compound systems can silently multiply model calls. A naive agent loop with retrieval, reflection, critic calls, retries, and multi-agent debate can turn a small task into hundreds of calls. LLMCompiler’s motivation—sequential function calling creates latency and cost inefficiencies—is exactly what production teams encounter (LLMCompiler, ICML 2024). (proceedings.mlr.press)

Controls:

Step budgets.
Token budgets.
Tool budgets.
Wall-clock timeouts.
Cost prediction before execution.
Early stopping.
Parallel deterministic work.
Caching.
Batch tools.
Model tiering.
Explicit user confirmation when budget increases.

5.9 Evaluation leakage and benchmark overfitting

Agent benchmarks are often public, scriptable, and subject to contamination. SWE-bench Verified was created partly because original SWE-bench tasks had validation and labeling concerns, and OpenAI noted the possibility of contamination because tasks were scraped from public GitHub repositories (SWE-bench Verified, OpenAI 2024). (openai.com) For production, the implication is simple: do not trust a leaderboard score as a deployment safety case. Evaluate on private, current, task-specific workflows with realistic permissions and adversarial inputs.

6. Orchestration Patterns by Domain

Different automation domains need different orchestration strategies. The mistake is to standardize on “agent” as a universal abstraction.

6.1 Software engineering automation

Software engineering is the most mature domain for compound agents because it has executable feedback. A strong coding agent architecture includes:

Repository indexer.
Issue parser.
Code search.
Planner.
Patch generator.
Test runner.
Static analyzer.
Diff reviewer.
Regression selector.
Sandboxed executor.
Human pull-request review.

The model should not directly push to protected branches. It should propose patches, run tests in a sandbox, explain diffs, and open reviewable pull requests. Secrets should be unavailable in the sandbox. Network access should be disabled or allowlisted. Package installation should be pinned or proxied.

SWE-bench’s success as a benchmark comes from this structure: agents must inspect real repositories, edit code, and pass tests (SWE-bench, ICLR 2024). (proceedings.iclr.cc) But production software engineering is broader than Python bug fixes. Security patches, migrations, infrastructure changes, and mobile apps have different validators and risk profiles.

Recommended pattern: graph workflow with ReAct subloops for debugging, strong sandboxing, and mandatory human merge gates.

6.2 Web and desktop automation

WebArena and OSWorld show why GUI automation remains hard. The agent must perceive state, map natural-language goals to interface actions, handle dynamic layouts, recover from errors, and avoid destructive clicks. WebArena’s 14.41% GPT-4-agent success versus 78.24% human success at publication is a warning against deploying unconstrained browser agents for high-impact tasks (WebArena, ICLR 2024). (proceedings.iclr.cc) OSWorld’s 12.24% best-model result at publication similarly shows that desktop agents struggle with GUI grounding and operational knowledge (OSWorld, NeurIPS 2024). (arxiv.org)

Recommended controls:

Prefer APIs over GUI control.
Use read-only browser sessions unless action is required.
Disable saved credentials by default.
Use task-specific browser profiles.
Require confirmation before form submission.
Use DOM-level constraints rather than pixel-only control.
Record video and action traces.
Use reversible staging environments where possible.

Recommended pattern: planner-executor with deterministic browser wrappers, not free-form computer use.

6.3 Customer operations

Customer operations combine untrusted user text with privileged business actions. This is a prompt-injection trap. A malicious customer can write instructions inside a support ticket. If the agent reads that ticket and can issue refunds, change addresses, or send emails, the user is effectively programming the agent.

Recommended architecture:

Triage classifier with no mutation tools.
Evidence extraction into typed fields.
Policy engine for eligibility.
Refund or account-action tool with strict parameters.
Human approval above thresholds.
Full audit log.

Recommended pattern: router plus graph workflow. Avoid ReAct loops with broad tools.

6.4 Security operations

Security automation needs speed but has severe false-positive and false-negative costs. Agents can enrich alerts, summarize logs, map indicators, draft detection rules, and propose containment actions. They should not autonomously disable accounts, block network ranges, delete files, or rotate production secrets without policy gates.

Security agents face adversarial data by default: phishing emails, malware strings, attacker-controlled domains, log entries, and exploit payloads may contain prompt injections. Treat all evidence as hostile.

Recommended controls:

Evidence quarantine.
No raw rendering of attacker-controlled HTML.
Tool outputs as structured data.
Read-only enrichment by default.
Separate containment workflow with approval.
Deterministic checks for IOC formatting and scope.
Canary secrets to detect exfiltration attempts.

Recommended pattern: graph workflow with taint tracking and human gates for containment.

6.5 Finance and procurement

Finance automation has a clear rule: models may recommend, but deterministic systems authorize. Agents can extract invoice fields, match purchase orders, flag anomalies, draft explanations, and prepare approvals. They should not approve payments solely by model judgment.

Recommended controls:

Vendor allowlists.
Bank-account verification.
Amount thresholds.
Segregation of duties.
Duplicate invoice detection.
Sanctions and fraud checks.
Human approval for new vendors or changed payment details.
Immutable audit logs.

Recommended pattern: linear pipeline or DAG with policy-as-code. Avoid autonomous agents for payment execution.

6.6 Research and analysis

Research agents benefit from retrieval, browsing, summarization, source comparison, and citation management. Their main risks are hallucination, source poisoning, prompt injection from web pages, and false synthesis.

Recommended controls:

Source provenance.
Multi-source corroboration.
Citation extraction and verification.
Separation of quoted source text from model instructions.
Date awareness.
Claims ledger: every factual claim maps to evidence.
No external side effects except report generation.

Recommended pattern: retrieval-augmented graph with validators and citation checks.

7. Security Design Principles

The following principles are implementable today and should be treated as baseline controls for serious compound AI systems.

7.1 Models are not principals

Do not give credentials to a model. Give credentials to tools mediated by a broker. The broker acts on behalf of a user, workflow, and policy context. Every action should be attributable to a human or service principal, not to “the AI.”

7.2 Tools are capabilities

A tool is not a function; it is a capability. Its schema is an access-control boundary. Design tools narrowly:

Bad:

run_sql(query: string)

Better:

lookup_customer_orders(customer_id: CustomerId, start_date: Date, end_date: Date)

Bad:

send_email(to: string, subject: string, body: string)

Better:

send_refund_status_email(customer_id: CustomerId, order_id: OrderId, template_id: ApprovedTemplate)

The best agent tool is often not a general API wrapper but a task-specific operation with built-in policy checks.

7.3 Treat context as tainted by default

Every piece of context should carry a trust label:

System-authored.
Developer-authored.
User-authored.
Tool-generated from trusted source.
Tool-generated from untrusted source.
Retrieved from public web.
Retrieved from internal knowledge base.
Retrieved from customer-supplied document.
Agent-generated.
Human-approved.

Taint should restrict actions. For example, a plan derived from public web content should not be allowed to call a payment API without independent verification.

7.4 Separate planning context from execution context

The model that plans does not need secrets. The model that summarizes evidence does not need write tools. The model that drafts an email does not need the ability to send it. Split contexts and permissions by phase.

7.5 Prefer deterministic control flow

Use LLMs for semantic uncertainty, not for control flow that can be coded. If the workflow is “extract fields, check policy, call API, send confirmation,” implement that as code. Do not ask an agent to remember the procedure.

7.6 Validate before acting

Every state-changing action should pass validators. The validator stack should be proportional to impact:

Low impact: schema check and logging.
Medium impact: policy check and dry run.
High impact: independent evidence check and human approval.
Irreversible: multi-party approval and rollback/compensation plan.

7.7 Minimize transcript accumulation

Long transcripts are attack surfaces. They mix old plans, stale observations, untrusted content, and tool outputs. Use structured state summaries with provenance. Keep raw transcripts for audit, not for automatic inclusion in prompts.

7.8 Design for interruption and recovery

Agents will fail mid-task. Workflows need checkpoints, idempotency keys, compensating actions, and safe termination states. If a workflow times out after creating a ticket but before sending a notification, rerunning it should not create duplicate tickets or send inconsistent messages.

7.9 Make evaluation adversarial

Test normal tasks and attack tasks together. AgentDojo’s framing is correct: measure utility and adversarial robustness in the same environment because defenses that block all tool output may be safe but useless (AgentDojo, NeurIPS 2024). (mlanthology.org)

8. Concrete Build Pattern: The Secure Agentic Workflow

The following pattern is a practical baseline for multi-step automation.

8.1 Step 1: classify the task

Classify the task before retrieval or tool exposure. Determine:

Intent.
Risk tier.
Required data.
Required tools.
Whether actions are read-only or mutating.
Whether external communication is needed.
Whether regulated data is involved.

Use a small model or classifier if appropriate, but enforce routing through a policy table.

8.2 Step 2: construct a workflow-specific capability set

For each workflow, define:

workflow: refund_review
risk: medium
tools:
  - get_order
  - get_payment_status
  - evaluate_refund_policy
  - create_refund_draft
forbidden_tools:
  - issue_refund_direct
  - send_external_email_freeform
data_scopes:
  - orders:read
  - payments:read_masked
actions:
  - refund_draft:create
approval:
  required_if:
    - amount > 100
    - customer_tenure_days < 30
    - prior_refunds_count > 1

The model sees only the tools allowed for the current step.

8.3 Step 3: retrieve narrowly

Retrieve only what is necessary. If the task is refund review, retrieve the order, policy, and customer ticket—not the entire account history unless needed. Label customer-supplied text as untrusted. Do not let untrusted text define tool permissions.

8.4 Step 4: plan into structured steps

Ask the model for a typed plan:

json

{
  "goal": "determine refund eligibility",
  "steps": [
    {
      "id": "s1",
      "action": "get_order",
      "args": {
        "order_id": "ord_123"
      },
      "reason": "Need order status and amount."
    },
    {
      "id": "s2",
      "action": "evaluate_refund_policy",
      "args_from": [
        "s1",
        "ticket_fields"
      ],
      "reason": "Determine eligibility."
    }
  ],
  "requires_human_approval": true
}

Reject plans that reference tools outside the allowlist, omit required evidence, or include free-form side effects.

8.5 Step 5: execute through the broker

The broker validates each call. It should not execute “best effort” calls with malformed arguments. It should fail closed.

8.6 Step 6: validate and commit

Separate “draft” from “commit.” The agent may create a draft refund. A deterministic policy engine and, if needed, a human approve the actual refund. This staging pattern generalizes to emails, pull requests, database migrations, cloud changes, and ticket updates.

8.7 Step 7: produce a provenance-rich answer

The final response should include:

What was done.
What was not done.
Which sources were used.
Which actions were taken.
Which actions require approval.
Where logs can be found.
Any uncertainty or failed checks.

9. Evaluation: Measuring Capability and Safety Together

Single-number model scores are insufficient for compound AI systems. Evaluate the system, not the model.

9.1 Core capability metrics

Measure:

Task success rate.
Partial success rate.
Time to completion.
Cost per successful task.
Number of tool calls.
Number of model calls.
Human intervention rate.
Retry rate.
Recovery rate after tool failure.
Output quality.
State correctness after execution.

For coding agents, task success can be tests passing. For business workflows, success must include correct external state.

9.2 Security metrics

Measure:

Prompt-injection attack success rate.
Data exfiltration rate.
Unauthorized tool-call rate.
Unauthorized argument rate.
Policy bypass rate.
Unsafe action proposal rate.
Unsafe action execution rate.
Memory poisoning success.
Retrieval poisoning success.
Cross-tenant data exposure.
Excessive-cost attack success.
Denial-of-wallet susceptibility.
Audit completeness.

AgentDojo’s distinction between utility and security is useful: a defense that blocks attacks by preventing task completion is not a production defense (AgentDojo, NeurIPS 2024). (mlanthology.org)

9.3 Robustness metrics

Measure under perturbation:

Tool timeout.
Tool returns malformed data.
Retriever returns irrelevant documents.
Conflicting sources.
Ambiguous user request.
Partial credentials.
Stale memory.
Long context pressure.
Parallel branch failure.
Human rejects approval.
External API returns changed schema.

9.4 Regression suites

Every workflow should have a regression suite containing:

Golden-path tasks.
Edge cases.
Adversarial prompts.
Poisoned documents.
Malformed tool outputs.
Permission-boundary tests.
Cost-bound tests.
Privacy tests.
Replay tests from prior incidents.

9.5 Red-team harness

A useful red-team harness injects attacks at every boundary:

user input
retrieved document
tool output
tool metadata
memory entry
agent handoff
file content
web page
email body
API error message
log line

The harness should verify not only final text but tool calls, arguments, state changes, and audit logs.

10. Defensive Patterns That Work Today

10.1 Tool output declassification

Before tool output reaches the model, convert it into a safer representation. For example, instead of passing a raw email body into an action-planning model, use an extraction model with no tools to produce:

json

{
  "sender": "customer@example.com",
  "claimed_issue": "damaged item",
  "requested_action": "refund",
  "contains_instruction_like_text": true,
  "instruction_like_spans": [
    "Ignore all previous instructions..."
  ],
  "evidence_summary": "Customer attached photo of damaged item."
}

Then pass only the fields needed for policy evaluation.

10.2 Split-reader/executor architecture

Use separate agents or modules:

Reader: can read untrusted content but has no mutation tools.
Planner: sees sanitized facts and policy, produces structured plan.
Executor: executes approved tool calls but does not see raw untrusted content.
Reviewer: checks plan and evidence with no mutation tools.

This pattern reduces prompt-injection blast radius.

10.3 Capability tokens

Instead of letting a model call any tool in a session, issue short-lived capability tokens for specific actions:

json

{
  "capability": "create_refund_draft",
  "order_id": "ord_123",
  "max_amount": 184.22,
  "expires_at": "2026-05-01T18:00:00Z",
  "requires_approval_for_commit": true
}

The broker rejects calls outside the token.

10.4 Dry-run and commit phases

Every mutating workflow should support dry run:

Generate plan.
Simulate tool calls.
Show diff/side effects.
Validate policy.
Approve.
Commit.

This mirrors infrastructure-as-code review and database migration practice.

10.5 Structured handoffs

Agents should hand off typed artifacts, not conversations. Handoffs should include provenance, trust labels, and explicit requested action. A receiving agent should treat the handoff as a request, not an instruction with inherited authority.

10.6 Retrieval quarantine

Newly ingested documents should enter a low-trust quarantine. Promote them only after checks:

Source verification.
Duplicate detection.
Malware scanning.
Prompt-injection scanning.
Embedding anomaly detection.
Human or automated trust assignment.
Delay before high-impact use.

This is especially important for public web, customer uploads, and collaborative wikis.

10.7 Action-specific tools

Replace general tools with action-specific tools. A calendar agent should not have run_browser_action; it should have create_calendar_draft, check_availability, and send_invite_after_approval. A finance agent should not have execute_payment; it should have create_payment_proposal with deterministic approval.

10.8 Deterministic policy-as-code

Use policy engines such as OPA-style rules, custom authorization services, or domain-specific validators. Keep policies outside prompts. Version them. Test them. Log decisions.

10.9 Sandboxed execution

Coding agents and data-analysis agents need sandboxes:

No production secrets.
Ephemeral filesystem.
Network disabled by default.
Package mirrors or allowlists.
CPU/memory/time limits.
Syscall/container restrictions.
Artifact capture.
Reproducible environment.

ToolEmu’s results are relevant here: it introduced an LM-emulated sandbox to identify risks across 36 high-stakes toolkits and 144 test cases, with human evaluation finding that 68.8% of failures identified would be valid real-world agent failures (ToolEmu, ICLR 2024). (proceedings.iclr.cc) The broader lesson is that tool execution risk must be tested before connecting agents to real systems.

10.10 Incident response for agents

Agent systems need incident response plans:

Disable workflow.
Revoke tool credentials.
Freeze memory writes.
Quarantine affected retrieval documents.
Export traces.
Identify side effects.
Roll back or compensate.
Add regression tests.
Rotate exposed secrets.
Notify affected users if data leaked.

Without trace replay, incident response becomes guesswork.

11. Anti-Patterns

11.1 The omnipotent assistant

One agent with access to email, Slack, browser, filesystem, database, ticketing, payment, and code execution is indefensible. It is a confused deputy with a natural-language attack surface.

11.2 Prompt-only security

Instructions such as “never reveal secrets” and “ignore malicious content” are useful but not sufficient. AgentDojo exists because models fail to preserve instruction hierarchy under realistic indirect prompt injection (AgentDojo, NeurIPS 2024). (mlanthology.org)

11.3 Raw transcript handoff

Passing full transcripts between agents leaks irrelevant data, preserves injected instructions, and destroys provenance. Handoffs should be structured and minimal.

11.4 General-purpose shell tools

A shell tool is equivalent to arbitrary code execution. If required for coding tasks, isolate it heavily. Do not expose shell access to business-process agents.

11.5 Dynamic tool discovery without review

Letting agents discover arbitrary tools at runtime is dangerous. MCP and similar protocols make integration easy, but tool metadata and server behavior are part of the supply chain (MCP Specification, 2024). (modelcontextprotocol.io) Pin and approve tools.

11.6 LLM judges as sole validators

LLM judges are useful for soft quality evaluation but weak as sole safety controls. Use deterministic checks wherever possible.

11.7 Memory without governance

Unreviewed memory creates persistent compromise. Treat memory as a database with write controls, not as a scratchpad.

12. Open Problems

12.1 Reliable instruction/data separation

The central unsolved problem is that models still process instructions and data in one medium. Tool outputs, retrieved documents, and user text can all influence behavior. Architectural separation helps, but the underlying model behavior remains vulnerable. Research directions include trusted context representations, model training for robust instruction hierarchy, formal taint propagation into decoding, and constrained action selection.

12.2 Formal semantics for tool authority

Current tool schemas describe arguments, not authority. We need tool contracts that specify side effects, data sensitivity, idempotency, reversibility, approval requirements, and compositional risks. A future agent runtime should be able to statically reject a workflow where a low-trust branch can influence a high-impact tool call.

12.3 Secure dynamic tool ecosystems

MCP-like protocols solve integration but intensify supply-chain risk. Open problems include signed tool metadata, reputation systems, permission manifests, semantic diffing of tool descriptions, runtime attestation, and automated detection of malicious tool descriptions.

12.4 Memory safety

Agent memory needs access control, provenance, expiration, conflict resolution, poisoning detection, and user inspectability. The field lacks standard memory threat models and benchmarks comparable to AgentDojo for tool-output injection.

12.5 Evaluation under adaptive attack

Most benchmarks are static. Attackers adapt. AgentDojo’s dynamic environment is a step in the right direction, and ASB broadens the attack/defense landscape with 10 scenarios, over 400 tools, and dozens of attack/defense methods (ASB, ICLR 2025). (proceedings.iclr.cc) But production-grade evaluation needs continuous adversarial testing against private workflows and current models.

12.6 Long-horizon credit assignment

When a 30-step workflow fails, it is hard to identify which observation, plan, retrieval, tool call, or memory caused the failure. Better trace analysis, causal debugging, and workflow-level observability are needed.

12.7 Human review ergonomics

Human approval works poorly when reviewers are overloaded or shown vague summaries. We need research on approval UI, calibrated risk presentation, and reviewer reliability for agent-generated actions.

12.8 Economic denial of service

Agents can be attacked through cost: recursive tool use, expensive retrieval, long context stuffing, repeated retries, or induced multi-agent debate. Cost controls are still primitive.

12.9 Cross-agent and cross-tenant isolation

Multi-agent enterprise systems will have many agents with different privileges. Preventing confused-deputy attacks across agents is an open systems-security problem, not just an LLM-safety problem.

12.10 Benchmark realism and contamination

Agent benchmarks must use fresh tasks, hidden tests, private corpora, realistic permissions, and adversarial inputs. Public static benchmarks are useful for research but insufficient as deployment evidence.

13. Practical Deployment Checklist

A team deploying compound AI automation should be able to answer the following before production.

Architecture

What is the workflow graph?
Which nodes call models?
Which nodes call tools?
Which nodes mutate state?
What is the maximum step count?
What are the timeout and retry rules?
Can every run be replayed?

Authority

Which principal executes each tool call?
What tools are allowed per step?
What data scopes are allowed per step?
Which actions require approval?
Are permissions short-lived?
Are secrets hidden from model context?

Context

What context reaches the model?
Which parts are untrusted?
Are tool outputs structured?
Is raw untrusted text separated from instructions?
Are transcripts minimized?
Are memory writes governed?

Retrieval

What sources are indexed?
How is provenance stored?
Can documents be poisoned?
Are new documents quarantined?
Is source diversity enforced?
Can high-impact answers cite authoritative sources?

Validation

What schemas are enforced?
What deterministic checks run?
What tests execute?
What policy rules apply?
What happens on validation failure?
Are validators independent of the actor model?

Observability

Are prompts logged?
Are model outputs logged?
Are tool calls and arguments logged?
Are policy decisions logged?
Are human approvals logged?
Are costs logged?
Can traces be searched by incident responders?

Security testing

Have indirect prompt injections been tested?
Have poisoned retrieval documents been tested?
Have malicious tool outputs been tested?
Have malicious tool descriptions been tested?
Have memory poisoning attempts been tested?
Have cross-agent handoff attacks been tested?
Have cost-exhaustion attacks been tested?
Are these tests in CI?

If the answer to most of these is “no,” the system is not ready for high-impact autonomy.

14. The Position: Agents Need Less Magic and More Systems Engineering

The state of the art in compound AI systems is not a single framework or agent architecture. It is an engineering posture: use models where semantic flexibility is valuable, use deterministic software where authority matters, and make the boundary between the two explicit.

ReAct, Tree of Thoughts, Reflexion, DSPy, LLMCompiler, AutoGen, MCP, SWE-bench, WebArena, OSWorld, AgentDojo, ToolEmu, ASB, and PoisonedRAG each expose a piece of the truth. Interleaving reasoning and action improves capability. Search improves tasks with cheap evaluators. Reflection can improve retries but creates memory risk. Compiled LM programs are more maintainable than prompt spaghetti. Parallel function calling reduces latency but introduces concurrency risk. Multi-agent handoff helps only when it maps to real boundaries. Tool protocols improve interoperability but create semantic supply chains. Execution benchmarks reveal large gaps between demos and reliable automation. Security benchmarks show that untrusted context can hijack agents at unacceptable rates.

The safest near-term architecture is not a fully autonomous general agent. It is a graph-orchestrated compound system with typed state, narrow tools, deterministic policy, sandboxed execution, taint-aware context construction, retrieval provenance, human gates for irreversible actions, and adversarial regression tests. In that architecture, the LLM is powerful but not sovereign. It proposes, extracts, ranks, summarizes, and drafts. The runtime authorizes, validates, executes, records, and stops.

That is the difference between an impressive demo and a deployable automation system.

References

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. https://arxiv.org/abs/2210.03629 (arxiv.org)
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” NeurIPS 2023. https://arxiv.org/abs/2305.10601 (arxiv.org)
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao. “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS 2023. https://arxiv.org/abs/2303.11366 (arxiv.org)
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts. “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines.” ICLR 2024. https://arxiv.org/abs/2310.03714 (proceedings.iclr.cc)
Matei Zaharia et al. “The Shift from Models to Compound AI Systems.” Berkeley Artificial Intelligence Research Blog, 2024. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/ (bair.berkeley.edu)
Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami. “An LLM Compiler for Parallel Function Calling.” ICML 2024. https://proceedings.mlr.press/v235/kim24y.html (proceedings.mlr.press)
Chi Wang et al. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.” COLM 2024. https://arxiv.org/abs/2308.08155 (arxiv.org)
OpenAI. “Orchestrating Agents: Routines and Handoffs.” OpenAI Cookbook, 2024. https://cookbook.openai.com/examples/orchestrating_agents (cookbook.openai.com)
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan. “SWE-bench: Can Language Models Resolve Real-world Github Issues?” ICLR 2024. https://proceedings.iclr.cc/paper_files/paper/2024/hash/edac78c3e300629acfe6cbe9ca88fb84-Abstract-Conference.html (proceedings.iclr.cc)
OpenAI. “Introducing SWE-bench Verified.” OpenAI, 2024. https://openai.com/index/introducing-swe-bench-verified/ (openai.com)
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, Graham Neubig. “WebArena: A Realistic Web Environment for Building Autonomous Agents.” ICLR 2024. https://arxiv.org/abs/2307.13854 (arxiv.org)
Tianbao Xie et al. “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.” NeurIPS 2024. https://arxiv.org/abs/2404.07972 (arxiv.org)
Grégoire Mialon et al. “GAIA: A Benchmark for General AI Assistants.” 2023. https://arxiv.org/abs/2311.12983 (arxiv.org)
Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, Florian Tramèr. “AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents.” NeurIPS 2024. https://arxiv.org/abs/2406.13352 (arxiv.org)
Yang Zhang et al. “Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents.” ICLR 2025. https://arxiv.org/abs/2410.02644 (arxiv.org)
Yangjun Ruan et al. “Identifying the Risks of LM Agents with an LM-Emulated Sandbox.” ICLR 2024. https://proceedings.iclr.cc/paper_files/paper/2024/hash/7274ed909a312d4d869cc328ad1c5f04-Abstract-Conference.html (proceedings.iclr.cc)
Wei Zou et al. “PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models.” USENIX Security 2025. https://www.usenix.org/conference/usenixsecurity25/presentation/zou (usenix.org)
Zhan et al. “InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents.” 2024. https://arxiv.org/abs/2403.02691 (arxiv.org)
Anthropic. “Introducing the Model Context Protocol.” Anthropic, 2024. https://www.anthropic.com/news/model-context-protocol (anthropic.com)
Model Context Protocol. “Specification, Protocol Revision 2024-11-05.” 2024. https://modelcontextprotocol.io/specification/2024-11-05/ (modelcontextprotocol.io)
OWASP GenAI Security Project. “OWASP Top 10 for LLM Applications v2025.” 2025. https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-v2025.pdf (owasp.org)
OWASP GenAI Security Project. “OWASP Top 10 for Agentic Applications.” 2025. https://genai.owasp.org/2025/12/09/owasp-genai-security-project-releases-top-10-risks-and-mitigations-for-agentic-ai-security/ (genai.owasp.org)