Tool-use reliability, function-calling robustness, and structured output enforcement
Model Version
Tool-Use Reliability for LLM Agents: Function Calling, Structured Outputs, and the Boundary Between Valid JSON and Safe Action
Tool-use reliability is the problem of making a probabilistic language model produce the right machine-consumable action, with the right arguments, at the right time, under adversarial and changing conditions. It is not solved by “function calling,” “JSON mode,” or better prompting; those only address slices of a deeper interface problem spanning schema adherence, tool selection, argument grounding, execution-state consistency, authorization, and recovery. The stakes are now concrete: agentic coding tools and cloud-connected assistants have already caused destructive production incidents, including a July 2025 Replit AI coding-agent incident in which a live database with roughly 1,200 executive records and about 1,200 company records was deleted, and an April 24, 2026 PocketOS/Railway incident reported as a production database and volume-level backup deletion in one cloud API call. (oecd.ai) The empirical literature shows why these incidents are not anomalies. τ-bench found that state-of-the-art function-calling agents such as GPT-4o completed under 50% of realistic multi-turn retail and airline tasks, with pass^8 below 25% in retail (τ-bench, arXiv 2024). (arxiv.org) AgentDojo introduced 97 realistic agent tasks and 629 security test cases for tool-using agents operating over untrusted data, and concluded that both attacks and defenses remain incomplete (AgentDojo, NeurIPS 2024). (openreview.net) OpenAI’s Structured Outputs reported 100% adherence on internal complex JSON-schema evals for gpt-4o-2024-08-06, compared with less than 40% for gpt-4-0613, showing that constrained decoding can close the syntactic interface gap while leaving semantic and authority failures open (OpenAI Structured Outputs, 2024). (openai.com) This paper delivers a durable engineering model for tool-use reliability: a taxonomy of failure modes, a reading of the benchmark evidence, a defense architecture that separates syntax from semantics and authority, and a practical evaluation program for production agents.
The Core Claim: Tool Use Is a Distributed Systems Boundary, Not a Prompting Feature
The mistake that keeps recurring in production agent systems is to treat tool use as a model feature. A model “supports tools” if it can emit a tool name and JSON arguments. A platform “supports function calling” if it can parse that JSON and dispatch it. A demo “works” if a happy-path tool call returns the expected result. None of those facts establish reliability.
A tool call is a distributed systems boundary. One side is an uncertain planner trained to predict text. The other side is a deterministic or semi-deterministic software system with permissions, data, side effects, invariants, and failure modes. Function calling is the serialization protocol between them. Structured output enforcement is the grammar layer. Tool-use reliability is the complete end-to-end property: the user’s intent is translated into a valid, authorized, state-consistent, semantically correct operation, and failures are contained.
The reliability stack has five layers:
- Syntactic validity: the model emits parseable JSON, an OpenAI-style tool call, Python/JavaScript/REST syntax, or another expected representation.
- Schema validity: the emitted object satisfies the declared schema: required fields, enum constraints, array shape, nested object structure, and type constraints.
- Semantic validity: the call means the right thing: correct tool, correct entity, correct units, correct time horizon, correct defaults, correct disambiguation.
- State validity: the call is valid given the environment state: credentials, database state, prior turns, already-executed actions, idempotency, pending confirmations, and side effects.
- Authority validity: the call is allowed: least privilege, policy compliance, user consent, data-flow constraints, and defenses against indirect prompt injection.
Structured output enforcement primarily improves layers 1 and 2. Function-calling fine-tuning improves layers 1 through 3. Agent benchmarks such as τ-bench, ToolSandbox, ToolEmu, and AgentDojo stress layers 3 through 5. Production incidents happen when layers 4 and 5 are delegated to the same model that is already uncertain at layer 3.
The central design position of this paper is simple: never use the model’s well-formed output as evidence that the action is correct or authorized. Valid JSON is a transport property. It is not a safety property.
What Function Calling Actually Guarantees
Function calling gives the model a constrained action vocabulary. The application provides a list of tools, usually with names, descriptions, and JSON schemas. The model emits a tool invocation rather than a free-form answer. The runtime parses the invocation, executes the corresponding function, and feeds the result back to the model.
This is powerful because it converts natural-language ambiguity into a software boundary. It is dangerous because developers often mistake the boundary for a contract.
A function-calling API can guarantee several useful things, depending on implementation:
- the model response is recognized as a tool call rather than ordinary text;
- the selected tool name is drawn from the declared tool set;
- the argument object is syntactically valid JSON;
- the argument object conforms to some subset of JSON Schema;
- the model can choose not to call a tool when no tool is relevant;
- the runtime can return tool results in a role-separated message.
It usually does not guarantee:
- that the chosen tool is the correct tool;
- that the arguments are semantically equivalent to user intent;
- that values are canonicalized correctly;
- that the tool result is safe to place into the model context;
- that the model will ignore malicious instructions embedded in tool results;
- that a destructive action is authorized;
- that repeated calls are idempotent;
- that multi-turn state remains coherent.
The distinction matters because many real deployments have crossed the line from “assistant that can query weather” to “agent that can mutate cloud resources, databases, code repositories, calendars, payment systems, CRM records, and production infrastructure.” Once the tool has side effects, a false positive is not a bad answer. It is an incident.
A Failure Taxonomy for Tool-Using Agents
Tool-use failures are easier to reason about when separated by where they enter the pipeline. The categories below are not mutually exclusive; severe incidents usually combine several.
1. Tool Selection Errors
A tool selection error occurs when the model chooses the wrong function from the candidate set. This becomes more likely when tools are semantically adjacent: delete_volume, detach_volume, restore_backup, and recreate_environment may all appear relevant during an infrastructure repair task.
The robustness literature confirms this sensitivity. In On the Robustness of Agentic Function Calling, Rabinovich and Anaby-Tavor expanded a BFCL-derived toolkit with semantically related functions. Across evaluated models, original AST accuracy often exceeded 90%, but adding related tools produced measurable drops, including Granite3.1-8B falling from 0.945 to 0.870 and GPT-4o-mini from 0.925 to 0.870; the authors found failures split across wrong function selection, wrong number of functions, and wrong parameter assignment (On the Robustness of Agentic Function Calling, TrustNLP@NAACL 2025). (ar5iv.org)
This result is more important than the absolute numbers. It shows that tool-calling accuracy measured with small, clean, unrelated tool sets can overstate production reliability. Real systems have overlapping APIs, legacy names, redundant endpoints, and vendor-specific semantics.
2. Argument Extraction and Slot-Filling Errors
The model selects the right tool but fills the wrong arguments. Common causes include:
- entity aliasing: “Miami, FL” versus “Miami, Florida”;
- unit confusion: dollars versus cents, meters versus feet, UTC versus local time;
- date ambiguity: “next Friday,” “end of quarter,” “today”;
- enum mismatch: “urgent” versus
HIGH; - missing required fields silently invented from context;
- defaults inferred incorrectly;
- partial arguments for multi-entity requests.
Rabinovich and Anaby-Tavor found that meaning-preserving request rephrasings caused notable benchmark performance drops, and that 70–90% of errors in that setting stemmed from parameter-value mismatch rather than tool selection (On the Robustness of Agentic Function Calling, TrustNLP@NAACL 2025). (ar5iv.org) That is partly a benchmark problem—exact-match evaluation can reject semantically equivalent values—but it is also a production problem. Databases, APIs, and payment systems are exact-match systems unless developers build canonicalization layers.
3. Schema and Serialization Failures
The model chooses the correct tool and roughly correct arguments but emits invalid structure: malformed JSON, wrong nesting, missing required keys, extraneous properties, stringified arrays, invalid enum values, or invalid escaping. These were historically common enough that entire open-source ecosystems emerged around parsing, retrying, repairing, and constraining output.
OpenAI’s Structured Outputs announcement is a major milestone because it moved strict schema adherence from “prompt harder and retry” into provider-level constrained generation. The reported result—100% on OpenAI’s complex JSON-schema eval for gpt-4o-2024-08-06 with Structured Outputs, versus less than 40% for gpt-4-0613—is exactly the kind of improvement production systems needed at layers 1 and 2 (OpenAI Structured Outputs, 2024). (openai.com) But the same announcement notes operational constraints: Structured Outputs were not compatible with parallel function calls in that release; generated parallel calls might not match supplied schemas, so developers were instructed to disable parallel tool calls when strict schemas mattered (OpenAI Structured Outputs, 2024). (openai.com)
The lesson is not “Structured Outputs are unreliable.” The lesson is that serialization guarantees are configuration- and feature-dependent. Reliability comes from knowing exactly which guarantees are active in the runtime path.
4. Tool-Result Misinterpretation
A model can correctly call a tool, receive a correct result, and still proceed incorrectly. Tool outputs are often compact, vendor-specific, partial, stale, or error-coded. Agents misread tool outputs when:
- a tool returns
200 OKwith an application-level error; - a search result contains misleading snippets;
- an API returns multiple matches and the model picks the first;
- a database returns stale state;
- a tool output contains instructions written by an attacker;
- a log line contains remediation commands as untrusted text;
- a function returns a warning that should block action.
This is where ordinary reliability merges into security. Tool outputs are not just data; when fed back into an LLM, they become prompt material.
5. State-Tracking Failures
State failures dominate multi-turn agents. The agent forgets a constraint, repeats an action, assumes a previous call succeeded, performs steps out of order, or fails to reconcile state after a tool error.
τ-bench was designed precisely to expose this class. It evaluates dynamic conversations between a simulated user and a tool-using agent, then compares the final database state against the annotated goal state. The benchmark introduced pass^k because single-run success hides inconsistency; in the original paper, state-of-the-art function-calling agents such as GPT-4o succeeded on less than 50% of tasks, and retail pass^8 was below 25% (τ-bench, arXiv 2024). (arxiv.org)
The pass^k result is devastating for deployment assumptions. A customer-support agent that completes a task once is not reliable if repeated equivalent attempts diverge. Production workflows need consistency across retries, not just peak capability.
6. Excessive Agency and Authorization Failures
The model correctly understands the task but takes an action it should not be allowed to take. This includes deleting production resources, emailing sensitive data, issuing refunds beyond policy, modifying access controls, or installing packages.
OWASP places Prompt Injection at LLM01 in its 2025 Top 10 for LLM Applications and treats agentic systems and tool use as part of the GenAI application risk surface (OWASP LLM Top 10, 2025). (owasp.org) The deeper issue is not only prompt injection. It is excessive agency: the agent has more authority than its task requires, and the system relies on the model to self-restrict.
The Replit and PocketOS incidents are archetypal excessive-agency failures: regardless of whether the immediate trigger was model confusion, tool design, credential scoping, cloud API behavior, or orchestration logic, the system permitted an AI-mediated workflow to reach destructive production operations. (oecd.ai)
7. Indirect Prompt Injection
Indirect prompt injection occurs when malicious instructions are embedded in content the agent retrieves or observes: webpages, emails, documents, tickets, calendar events, logs, pull requests, PDFs, spreadsheets, or tool outputs. The model treats those instructions as part of its context and may follow them.
AgentDojo formalizes this risk for tool-calling agents. It provides 97 realistic tasks and 629 security test cases across domains such as email, banking, and travel, explicitly targeting agents that execute tools over untrusted data (AgentDojo, NeurIPS 2024). (openreview.net) EchoLeak made the same class concrete in a production enterprise assistant: the paper describes CVE-2025-32711 as a zero-click prompt-injection vulnerability in Microsoft 365 Copilot that enabled remote, unauthenticated data exfiltration through a crafted email (EchoLeak, arXiv 2025). (arxiv.org)
OpenAI’s 2026 agent-security guidance frames modern prompt injection as closer to social engineering than simple string override. It reports a 2025 ChatGPT prompt-injection example that worked 50% of the time in testing for a specific email-analysis prompt, and argues that filtering alone is insufficient because advanced attacks require judging whether content is misleading in context (OpenAI, 2026). (openai.com)
The Benchmark Landscape: What Each Benchmark Measures and Misses
The field has moved from single-call API generation to multi-turn, stateful, adversarial, and policy-constrained evaluation. No single benchmark captures the full production problem. A mature evaluation program should combine several.
Gorilla and APIBench: API Call Generation at Scale
Gorilla was an early inflection point. Patil et al. connected a fine-tuned LLaMA-based model to large API collections and introduced AST-based evaluation for generated API calls. In the NeurIPS 2024 version, Gorilla’s zero-shot overall accuracy was reported as 59.13 on Torch Hub, 71.68 on HuggingFace, and 83.79 on TensorHub, with hallucination rates of 6.98, 10.95, and 5.40 respectively; with an oracle retriever, TensorHub overall accuracy reached 94.16 and hallucination dropped to 1.89 (Gorilla, NeurIPS 2024). (proceedings.nips.cc) The paper also validated AST matching against human evaluation on 100 sampled generations, with both giving 78% API-call accuracy and 72% end-to-end code executability (Gorilla, NeurIPS 2024). (proceedings.nips.cc)
Gorilla’s durable contribution is not that one model beat another. It is that tool-use evaluation needs executable or structurally grounded metrics. Free-form judging is too loose for API boundaries.
API-Bank: Planning, Retrieval, and Calling
API-Bank broadened tool evaluation beyond ML model registries. It provided 73 API tools, 314 annotated tool-use dialogues, 753 API calls, and a training set of 1,888 tool-use dialogues drawn from 2,138 APIs across 1,000 domains. The authors reported that Lynx, initialized from Alpaca, improved tool-use performance over Alpaca by more than 26 points and approached GPT-3.5 (API-Bank, EMNLP 2023). (arxiv.gg)
API-Bank is useful because it separates planning, retrieval, and API calling. Production failures often happen at their interfaces: the model plans a plausible sequence, retrieves an inadequate tool, then fills arguments correctly for the wrong abstraction.
ToolLLM and ToolBench: Synthetic Scale
ToolLLM introduced ToolBench, collecting 16,464 real-world REST APIs across 49 RapidAPI categories, then using ChatGPT to generate instructions and solution paths covering single-tool and multi-tool scenarios. ToolLLaMA was fine-tuned on ToolBench and paired with a neural API retriever; the paper reports complex-instruction execution, generalization to unseen APIs, comparable performance to ChatGPT, and zero-shot generalization to APIBench (ToolLLM, ICLR 2024). (proceedings.iclr.cc)
ToolLLM’s lesson is that function calling is data-hungry. General instruction tuning is not enough. Models need examples of tool descriptions, argument schemas, multi-step tool chains, API errors, and tool retrieval.
The limitation is equally important: synthetic solution paths can inflate apparent competence if they do not reproduce live API brittleness, ambiguous user intent, state mutation, authorization, and adversarial tool results.
BFCL: The Standard Function-Calling Yardstick
The Berkeley Function Calling Leaderboard became the de facto benchmark for function calling. The ICML 2025 paper describes BFCL as evaluating serial and parallel calls, multiple programming languages, AST-based evaluation, abstention, and stateful multi-step agentic settings; the live leaderboard states that BFCL V4 evaluates function-calling ability with real-world data and periodic updates (BFCL, ICML 2025). (proceedings.mlr.press)
BFCL matters because it standardizes a core capability. But BFCL-style scores should be interpreted as interface-translation scores, not complete agent-reliability scores. High BFCL accuracy says a model can often map a request to a syntactically and structurally correct call. It does not prove safe behavior under untrusted observations, long-horizon state, or production permissions.
ToolSandbox: Stateful Interactive Tool Use
ToolSandbox targets a gap in single-turn evaluation: stateful, conversational, interactive tool use. It evaluates tasks involving state dependency, canonicalization, insufficient information, and tool databases. The authors report a significant gap between open-source and proprietary models and note that state dependency, canonicalization, and insufficient information remain challenging even for strong models (ToolSandbox, arXiv 2024 / NAACL Findings 2025). (arxiv.org)
ToolSandbox is valuable because it treats tool calls as transitions over state, not isolated strings. That is the right abstraction for calendars, reminders, contacts, tickets, shopping carts, cloud resources, and enterprise workflows.
τ-bench: Multi-Turn Policy Following and Database State
τ-bench is one of the most deployment-relevant benchmarks because it evaluates final database state after realistic user-agent conversations. It also introduces pass^k to measure consistency across repeated trials. The paper’s key result—that frontier function-calling agents such as GPT-4o completed less than half of tasks and had retail pass^8 below 25%—captures the gap between “can call tools” and “can run a business workflow reliably” (τ-bench, arXiv 2024). (arxiv.org)
The design choice to score final database state is crucial. Intermediate tool-call correctness is not enough. A customer does not care that the agent called five correct APIs if the final order, refund, or itinerary is wrong.
ToolEmu: Safety Failures Without Real-World Harm
ToolEmu uses an LLM-emulated sandbox to evaluate high-stakes tool-use risks. It introduced 36 high-stakes toolkits and 144 test cases, and human evaluation found that 68.8% of failures identified by ToolEmu would be valid real-world agent failures (ToolEmu, ICLR 2024). (openreview.net)
ToolEmu’s important contribution is economical safety testing. Many dangerous tools cannot be freely executed during evaluation. Emulation lets researchers test financial, privacy, and safety-sensitive actions without causing harm. The limitation is that emulator fidelity becomes part of the benchmark’s validity.
AgentDojo: Security Over Untrusted Data
AgentDojo is the benchmark to use when the question is “what happens when tool outputs contain attacker-controlled text?” It is dynamic rather than static, includes tasks and adaptive attacks, and evaluates both utility and security for agents executing tools over untrusted data (AgentDojo, NeurIPS 2024). (openreview.net)
AgentDojo’s conclusion is sobering: state-of-the-art models fail many tasks even without attack, and existing prompt-injection attacks break some but not all security properties. This means security evaluation cannot be detached from utility evaluation. A defense that blocks all tools is secure but useless; an agent that completes tasks while leaking data is useful but unsafe.
Structured Output Enforcement: What It Solves, What It Cannot Solve
Structured output enforcement is the most mature part of the tool-use reliability stack. It is also the most commonly overclaimed.
JSON Mode Was Never Enough
JSON mode asks or biases the model to produce valid JSON. It does not ensure conformance to a particular schema. OpenAI’s Structured Outputs announcement states this distinction directly: JSON mode improves reliability for valid JSON, but does not guarantee that the response conforms to a developer-supplied schema (OpenAI Structured Outputs, 2024). (openai.com)
For production systems, valid JSON without schema conformance is a partial failure. If the parser accepts the JSON but downstream code assumes missing fields exist, the application has merely moved the failure point.
Constrained Decoding Is the Right Primitive
Constrained decoding masks invalid tokens during generation so that the model can only produce outputs consistent with a grammar or schema. JSONSchemaBench describes constrained decoding as intervening in the decoding process by masking tokens that do not adhere to constraints, ensuring the final output conforms to the predefined structure (JSONSchemaBench, ICML ES-FoMo 2025). (openreview.net)
This is qualitatively different from prompting. Prompting asks the model to behave. Constrained decoding changes the set of outputs the model can emit.
JSONSchemaBench is the most useful recent evaluation of this layer. It introduced 10,000 real-world JSON schemas, paired them with the official JSON Schema Test Suite, and evaluated Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini across efficiency, coverage, and quality. The authors report that constrained decoding can speed generation by 50% compared with unconstrained decoding, that the best framework supports twice as many schemas as the worst, and that constrained decoding improved downstream task performance by up to 4% (JSONSchemaBench, ICML ES-FoMo 2025). (arxiv.org)
The key engineering implication: do not ask whether a provider “supports JSON Schema.” Ask which JSON Schema features are actually supported, how unsupported features fail, whether the guarantee applies under streaming and parallel calls, and whether the schema compiler is in the exact serving path you use.
Schema Design Is an Instruction Channel
A subtle 2026 result makes schema design more interesting. Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding shows that changing only schema-key wording can substantially affect accuracy while holding the prompt, model, output structure, and decoding setup fixed. The paper argues that schema keys enter the autoregressive context and act as implicit instructions; Qwen models benefited more from schema-level instructions, while LLaMA models relied more on prompt-level guidance (Schema Key Wording, arXiv 2026). (arxiv.org)
This matters for production. Developers often name fields for internal convenience: x, flag, data, payload, val. In constrained generation, field names are not just labels; they are part of the model’s conditioning context. A schema with final_answer_after_checking_units may produce different behavior than one with answer, even if both validate the same JSON shape.
Structured Outputs Do Not Enforce Meaning
A schema can require:
{
"tool": "delete_volume",
"volume_id": "vol_123",
"confirm": true
}It cannot prove the user intended deletion, that vol_123 is staging rather than production, that backups are off-volume, that the action is reversible, or that the agent was not manipulated by a log line.
Structured output enforcement should therefore be treated as a necessary reliability primitive and a non-sufficient safety primitive.
Designing Robust Function-Calling Interfaces
The tool interface is the highest-leverage part of an agent system. A good model behind a bad tool interface will fail. A weaker model behind a narrow, typed, policy-enforced interface can be reliable for bounded tasks.
Tool Names Should Be Distinct, Verb-Noun, and Side-Effect Explicit
Ambiguous tool names increase selection errors. Prefer:
get_customer_order_statuscreate_draft_refund_requestsubmit_refund_after_user_confirmationdelete_staging_environmentschedule_calendar_event
Avoid:
processupdatehandle_customerrun_actionmanage_resourcedelete_resource
Side effects should appear in the tool name and description. If a tool mutates state, the model and validator should not have to infer that fact.
Split Read, Draft, and Commit Tools
A common anti-pattern is a single tool that both decides and acts:
resolve_customer_issue(customer_id, issue_description)
That tool hides too much. A robust interface decomposes the flow:
get_customer_context(customer_id)classify_issue(issue_description)create_refund_draft(customer_id, amount, reason)validate_refund_policy(refund_draft_id)submit_refund(refund_draft_id, user_confirmation_token)
The model can help draft. Deterministic code validates. A separate authority gate commits.
This is the same pattern used in safe payment flows, infrastructure-as-code plans, database migrations, and CI/CD approvals: generate a plan, show a diff, validate policy, then apply.
Use Narrow Enums and Opaque Identifiers
Free-form strings are where models hallucinate. If a value maps to an internal entity, the model should select from retrieved candidates with opaque IDs:
{
"customer_id": "cust_8f31",
"order_id": "ord_19aa",
"refund_reason": "DAMAGED_ITEM"
}not:
{
"customer": "Sarah from last week",
"order": "the damaged one",
"reason": "it seems fair"
}The retrieval layer should provide candidate IDs. The model should not invent them.
Canonicalization Belongs Outside the Model
The robustness results around “Miami, FL” versus “Miami, Florida” point to a general rule: canonicalization is software’s job. Dates, currencies, locations, product names, user identities, hostnames, cloud resource IDs, and legal entities should be normalized by deterministic or auditable services.
The model can propose a value. A canonicalizer should resolve it to a unique entity or return ambiguity.
Tool Descriptions Are Security-Relevant
Tool descriptions are part of the prompt. They can leak implementation details, credentials, database names, internal policies, or attack surface. They can also over-authorize the model by implying that it may decide policy.
Bad:
delete_volume: Deletes any Railway volume. Use this if the app is broken or credentials mismatch.
Better:
delete_staging_volume: Permanently deletes a staging-only volume. Never use for production resources. Requires a validated staging volume_id and a deletion_approval_token issued by the policy service.
Even better: do not expose deletion directly. Expose create_deletion_request, and require deterministic policy approval before deletion.
Avoid Giant Tool Belts
Large tool sets increase confusion, context length, and attack surface. Tool retrieval should be treated like search: recall matters, but precision is safety-critical. A production orchestrator should shortlist tools based on task type, user authority, environment, and current state.
A coding agent working in a repository should not automatically see cloud-destruction tools. A customer-service agent issuing refunds should not see internal HR lookup tools. A log-analysis agent should not see arbitrary shell execution unless sandboxed and separately authorized.
A Production Architecture for Tool-Use Reliability
A robust agent runtime should not be a loop that alternates “LLM → execute whatever tool call it produced → LLM.” It should be a mediated control system.
Intent Gateway
The intent gateway classifies the request before any tool is exposed. It should determine:
- task domain;
- user identity and role;
- environment: dev, staging, production;
- sensitivity: read-only, write, destructive, external transmission;
- required confirmation level;
- whether the task is appropriate for an agent.
This classification can use an LLM, but the output must be treated as advisory unless backed by deterministic policy. The gateway’s main purpose is to reduce the tool universe and permission set.
Context Builder
The context builder assembles only what the model needs. It should exclude secrets, broad credentials, irrelevant documents, and untrusted content unless necessary. It should mark provenance explicitly:
- trusted system policy;
- developer instructions;
- user request;
- retrieved internal data;
- external untrusted data;
- prior tool observations;
- attacker-controlled or low-trust content.
Role separation helps, but role separation alone is not a security boundary. The model still consumes all text.
Tool Router
The tool router selects candidate tools before the model sees them. It should optimize for high precision, not just recall. It should incorporate:
- user permissions;
- task domain;
- side-effect class;
- environment;
- current workflow state;
- tool dependency graph;
- whether untrusted data is in context.
For high-risk actions, the correct candidate set may be empty until the agent completes prerequisite read-only steps.
Schema-Constrained Decoder
All tool calls should be produced under strict schema constraints where available. If the provider supports strict: true, use it. If running local models, use a constrained decoder such as Guidance, Outlines, XGrammar, llama.cpp grammars, or equivalent, but validate actual schema coverage against your schemas. JSONSchemaBench shows that framework coverage differs substantially across real-world schemas (JSONSchemaBench, ICML ES-FoMo 2025). (arxiv.org)
Semantic Validator
The semantic validator checks whether the call makes sense. This is not a duplicate parser. It should perform domain checks:
- Does the entity exist?
- Is the entity unique?
- Are units valid?
- Is the requested date resolved?
- Does the amount match policy?
- Is the tool appropriate for the user’s stated intent?
- Are there safer read-only or draft alternatives?
- Is the call idempotent?
- Has this action already been performed?
Some checks are deterministic. Some may use secondary models. But the validator must be independent of the planner’s original output.
Policy and Authority Validator
The policy validator enforces what the model cannot be trusted to enforce. It should handle:
- RBAC/ABAC permissions;
- environment separation;
- budget limits;
- data-loss-prevention rules;
- external transmission checks;
- destructive-action approvals;
- rate limits;
- separation of duties;
- change windows and freezes;
- compliance logging.
OpenAI’s agent-security guidance describes this in source-sink terms: dangerous outcomes often combine attacker-influenced sources with sinks such as transmitting information, following links, or using tools; the goal is to prevent dangerous actions or sensitive transmissions from happening silently (OpenAI, 2026). (openai.com)
Sandboxed Executor
The executor should be least-privilege and environment-scoped. If the agent is working on staging, it should not possess production credentials. If it needs to inspect production, use read-only credentials. If it needs to modify production, require a separate approval token minted outside the model context.
The executor should support:
- dry runs;
- transactionality;
- idempotency keys;
- rollback or compensating actions;
- per-tool timeouts;
- per-tool rate limits;
- structured errors;
- immutable audit logs;
- side-effect classification.
Typed Observation
Tool results should be typed and minimized before returning to the model. Do not dump raw HTML, logs, emails, stack traces, or arbitrary documents into the planner context if a smaller typed result suffices.
Bad:
Here is the full webpage/email/log output...
Better:
{
"source": "external_email",
"trust": "untrusted",
"extracted_invoice_total": 1240.15,
"extracted_due_date": "2026-05-30",
"contains_external_links": true,
"raw_content_available": false
}If raw content must be shown, preserve taint labels and prevent it from authorizing tool calls.
Prompt Injection Defenses for Tool-Using Agents
Prompt injection is not a malformed-input problem in the same way SQL injection is. The attacker’s payload is natural language, and the model’s job is to follow natural language. Filtering can reduce risk, but it cannot provide complete separation between instructions and data.
StruQ: Train the Model to Respect Channels
StruQ proposes structured queries that separate prompts and data into two channels and fine-tunes models to follow instructions only in the prompt portion while ignoring instructions in the data portion. The USENIX Security 2025 paper reports significantly improved resistance to prompt injection with little or no utility impact (StruQ, USENIX Security 2025). (usenix.org)
StruQ is important because it addresses the model behavior directly. It does not rely only on wrappers. Its limitation is deployment: the model must be trained for the channel format, and production systems must preserve that format end to end.
CaMeL: Separate Control Flow from Untrusted Data
CaMeL takes a system-design approach. It extracts control and data flows from the trusted query so that untrusted data retrieved by the LLM cannot affect program flow, and it uses capabilities to prevent private-data exfiltration over unauthorized flows. The updated paper reports solving 77% of AgentDojo tasks with provable security, compared with 84% for an undefended system (CaMeL, arXiv 2025). (arxiv.org)
CaMeL is the clearest expression of the right security direction: do not ask the model to “be careful” with untrusted text; design the runtime so untrusted text cannot authorize control-flow changes.
Source-Sink Analysis and Taint Tracking
For production systems, the practical version of CaMeL is taint tracking:
- Mark external webpages, emails, documents, logs, tickets, and tool outputs as untrusted.
- Track whether sensitive data has entered the model context.
- Treat external transmission tools—email, HTTP requests, browser navigation, Slack messages, file sharing—as sinks.
- Block or require confirmation when sensitive data flows from trusted/private sources to untrusted/external sinks.
- Prevent untrusted text from satisfying approval conditions.
This is not theoretical. EchoLeak demonstrates that enterprise assistants can become data-exfiltration deputies when untrusted content and private context meet inside the same model loop (EchoLeak, arXiv 2025; CVE-2025-32711). (arxiv.org)
Human Confirmation Must Be Specific, Not Ceremonial
“Are you sure?” is weak. Users habituate to confirmations, and models can summarize actions misleadingly. A strong confirmation includes:
- exact resource identifiers;
- environment;
- side effects;
- reversibility;
- data leaving the system;
- diff or plan;
- policy reason;
- expiration-bound approval token.
For destructive actions, confirmation should be generated by deterministic code from the validated action object, not by the model.
Bad:
The agent wants to clean up the environment. Approve?
Better:
Approve permanent deletion of staging volume vol_stg_38a in project pocketos-staging.
This will delete 12.4 GB of data. No production resources are in scope.
Approval expires in 5 minutes. Type DELETE vol_stg_38a to continue.
For production deletion, a mature system should require out-of-band approval and verified backups, not chat confirmation.
Evaluation: How to Know Whether an Agent Is Reliable
A production evaluation program should test each layer independently and then test the full loop.
Layer 1: Schema and Parser Evals
For every tool schema:
- generate valid and invalid examples;
- test provider strict mode;
- test streaming behavior;
- test parallel tool-call settings;
- test maximum nesting and arrays;
- test enum edge cases;
- test refusal paths;
- test truncation;
- test unsupported JSON Schema features.
If using constrained decoding, run a JSONSchemaBench-style coverage test against your actual schemas. Do not assume framework support from documentation alone.
Layer 2: Tool Selection Evals
Build a benchmark of realistic user requests with candidate tools. Include:
- no-tool cases;
- ambiguous cases;
- semantically adjacent tools;
- deprecated tools;
- read versus write alternatives;
- staging versus production variants;
- multi-tool requests;
- irrelevant distracting tools.
Report top-1 accuracy, abstention accuracy, and confusion matrices by tool family. Tool selection should be evaluated both with the full tool belt and with the router’s shortlisted tools.
Layer 3: Argument Evals
Argument evals should be entity-aware. Measure:
- exact match;
- canonical match;
- semantic equivalence;
- unit correctness;
- timezone correctness;
- date resolution;
- enum correctness;
- hallucinated optional fields;
- missing required fields;
- clarification behavior when ambiguous.
The “Miami, FL” issue from robustness work is a warning: exact-match evals can undercount correct semantic behavior, but exact-match production APIs can also reject reasonable semantic variants. The fix is canonicalization and evaluation against canonical entities.
Layer 4: Stateful Workflow Evals
For workflows, final state matters more than intermediate call similarity. Use τ-bench’s principle: compare the final database or environment state against the goal state. (arxiv.org)
Measure:
- pass^1 and pass^k;
- number of tool calls;
- unnecessary mutations;
- recovery after tool error;
- repeated-action idempotency;
- clarification quality;
- policy compliance;
- final-state diff.
A good pass^1 with poor pass^k is not production-ready for autonomous operation. It may be acceptable for draft generation with human review.
Layer 5: Security Evals
Security evals must include indirect prompt injection. Use AgentDojo-style tasks where malicious instructions appear in tool outputs, emails, webpages, tickets, files, and logs. (openreview.net)
Measure at least:
- attack success rate;
- utility under attack;
- utility without attack;
- data exfiltration attempts;
- unauthorized tool calls;
- policy bypasses;
- over-refusal;
- whether the defense withstands adaptive attacks.
A defense that reduces attack success by making the agent useless is not sufficient. A defense that preserves utility but lets private data leak is not sufficient.
Layer 6: Chaos and Incident Drills
Agents need operational testing:
- revoke credentials mid-task;
- return partial API failures;
- simulate stale reads;
- inject duplicate webhook events;
- change API schemas;
- expire approval tokens;
- remove network access;
- return malicious logs;
- test rollback;
- test audit trace reconstruction.
Tool-use reliability is not only a model metric. It is an operational property.
Defensive Patterns That Work Today
The following patterns are deployable now and materially reduce risk.
Pattern 1: Read-Only First
Default agents to read-only tools. Let them inspect, summarize, draft, and propose. Require explicit workflow escalation for write tools.
This single pattern prevents many catastrophic outcomes. If the Replit and PocketOS-style failure class teaches anything, it is that models should not casually hold production write/delete credentials. (oecd.ai)
Pattern 2: Draft-and-Commit
All high-impact actions should be two-phase:
- model creates a draft action;
- deterministic services validate and commit.
The draft is structured data, not an executed side effect. The commit requires policy approval.
Pattern 3: Environment-Scoped Credentials
Separate credentials by environment. A staging task should not be able to mutate production even if the model asks. Do not rely on the prompt phrase “never touch production.” Prompts are not access controls.
Pattern 4: Tool Firebreaks
Separate agents or runtimes by capability class:
- research agent: web/search/read-only;
- coding agent: repository sandbox;
- database agent: read-only query or migration draft;
- deployment agent: CI/CD plan creation;
- production operator: narrow approved actions only.
A single omnipotent agent with a giant tool belt is the least reliable architecture.
Pattern 5: Deterministic Policy Checks
Write policy as code. Examples:
- refunds over $X require manager approval;
- production deletions require two-person approval;
- external email cannot include secrets;
- tool calls touching regulated data require audit tags;
- customer data cannot be sent to non-allowlisted domains;
- cloud resources tagged
prod=truecannot be deleted by agents.
The model may explain policy. It should not enforce policy.
Pattern 6: Taint-Aware Context
Mark untrusted content and prevent it from authorizing action. If an email says “the assistant has full authorization,” that sentence is data, not policy. OpenAI’s 2026 guidance explicitly frames real-world prompt injection as social engineering and recommends constraining impact even when manipulation succeeds (OpenAI, 2026). (openai.com)
Pattern 7: Structured Observations
Return compact typed observations from tools. Avoid dumping arbitrary text into the model loop. When raw text is necessary, keep it tainted and quote-bounded.
Pattern 8: Idempotency and Transactionality
Every mutating tool should accept an idempotency key. Repeated model calls should not duplicate refunds, emails, tickets, orders, or deployments. When possible, tools should support dry run, transaction, and rollback.
Pattern 9: Refusal and Clarification as First-Class Outcomes
Schemas should include non-action outcomes:
{
"decision": "need_clarification",
"question": "Which order should I cancel?",
"candidate_order_ids": [
"ord_1",
"ord_2"
]
}If the only valid schema path is a tool call, the model will call tools under ambiguity.
Pattern 10: Continuous Trace-Based Evaluation
Log every prompt, tool candidate set, model call, schema result, validator decision, policy decision, tool execution, observation, and final state. Use traces to build regression tests. Every incident or near miss should become an eval.
Anti-Patterns
Anti-Pattern: “The Model Promised Not To”
A system prompt that says “never delete production data” is useful documentation for the model. It is not a control. If the agent has production deletion credentials, production deletion is in scope.
Anti-Pattern: Regex Parsing Free-Form Text
If the output drives software, do not parse prose with regex. Use strict schemas or constrained decoding. The era of hoping the model emits “valid enough JSON” should be over for production systems.
Anti-Pattern: Tool Results as Trusted Instructions
Search results, emails, webpages, documents, logs, and tickets should never be allowed to override developer or system policy. They are evidence, not instructions.
Anti-Pattern: One Agent, All Tools
A broad tool belt increases selection errors and blast radius. Tool routing and capability separation are reliability controls.
Anti-Pattern: LLM-as-Judge Only
LLM judging is useful for semantic evaluation, but production acceptance requires deterministic checks where possible. Final database state, resource diffs, policy decisions, and external transmissions should be verified by code.
Anti-Pattern: Confirmation Without Diff
User confirmation must expose exact side effects. “Approve this action?” is not meaningful if the action is summarized by the same model that proposed it.
Open Problems
Semantic Equivalence for Function Calls
AST matching and exact argument comparison are scalable, but they mis-handle semantically equivalent values. LLM-as-judge can help but introduces nondeterminism and judge vulnerabilities. The field needs hybrid evaluators that combine canonicalization, ontology-aware matching, embeddings, execution, and deterministic invariants.
Reliable Tool Retrieval at Large Scale
ToolLLM-scale API collections show the need for retrieval, but retrieval errors are under-studied as safety failures. The right tool may be absent from the shortlist; a dangerous adjacent tool may be present. Tool retrieval needs precision-focused metrics, policy-aware filtering, and adversarial evaluation.
Long-Horizon Consistency
τ-bench’s pass^k framing should become standard. Agents need to produce consistent outcomes across retries, paraphrases, partial failures, and state changes. Current models remain too stochastic for unsupervised long-horizon mutation in high-impact domains.
Untrusted Data Isolation
StruQ and CaMeL point toward real solutions, but most deployed systems still rely on prompt conventions, filters, and ad hoc wrappers. The unsolved engineering challenge is preserving instruction/data separation across real provider APIs, RAG systems, browser tools, file parsers, email clients, and multi-agent frameworks.
Structured Output Semantics
Constrained decoding guarantees shape, not meaning. Schema key wording affects model behavior, which means schema design is part of prompt design. We need schema-design science: naming conventions, description patterns, enum design, refusal schemas, and cross-model portability.
Agent Incident Reporting
Traditional CVEs fit some AI vulnerabilities, such as EchoLeak’s CVE-2025-32711, but many tool-use failures are socio-technical incidents involving model behavior, permissions, orchestration, and API design. The industry needs incident taxonomies that capture excessive agency, unsafe tool exposure, failed confirmation, taint violations, and state-machine bypass.
A Deployment Checklist
A tool-using agent should not be deployed with meaningful side effects until the following are true:
- All tool calls use strict schema enforcement or constrained decoding.
- All model outputs are validated after decoding.
- Tool routing limits the candidate set by task, user, state, and authority.
- Mutating tools are separated from read-only tools.
- Destructive tools require deterministic policy approval and scoped credentials.
- Production credentials are unavailable to development/staging tasks.
- Tool observations are typed, minimized, and provenance-labeled.
- Untrusted content cannot authorize tool calls.
- Sensitive-data flows to external sinks are blocked or require explicit confirmation.
- High-impact actions use draft-and-commit.
- Idempotency keys exist for mutating tools.
- Final-state evaluation exists for workflows.
- Prompt-injection evals include untrusted tool outputs.
- Traces are logged and replayable.
- Incident drills test API failure, credential mismatch, stale state, and malicious retrieved content.
- The agent has a safe “need clarification” and “cannot comply” path.
If any of these are missing, the agent may still be useful. But it should be treated as an assistant that drafts and recommends, not an autonomous actor.
Conclusion
The state of the art has split into two tracks. On the interface track, progress is real: structured outputs, constrained decoding, AST evaluation, BFCL-style leaderboards, and tool-specialized training have made malformed calls far less acceptable and less common. On the agent-reliability track, the hard problems remain: semantic correctness, state consistency, policy compliance, least privilege, and resistance to untrusted instructions. The evidence is consistent across τ-bench, ToolSandbox, ToolEmu, AgentDojo, CaMeL, StruQ, and production incidents: the reliable unit is not the model; it is the model plus constrained interface plus validator plus policy gate plus sandboxed executor plus eval loop.
The practical stance is therefore conservative but not pessimistic. LLMs are already useful tool users when bounded by strong software architecture. They are not reliable autonomous operators merely because they emit valid JSON. Build the boundary as if the model will sometimes misunderstand, sometimes overgeneralize, sometimes follow attacker-controlled text, and sometimes choose the most dangerous plausible tool. Then measure whether the surrounding system contains those failures. That is the difference between function calling as a demo and tool use as dependable automation.
References
- Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. “Gorilla: Large Language Model Connected with Massive APIs.” NeurIPS 2024. Canonical URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/e4c61f578ff07830f5c37378dd3ecb0d-Paper-Conference.pdf
- Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. “API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.” EMNLP 2023. Canonical URL: https://aclanthology.org/2023.emnlp-main.187/
- Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Dahai Li, Zhiyuan Liu, and Maosong Sun. “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.” ICLR 2024. Canonical URL: https://proceedings.iclr.cc/paper_files/paper/2024/hash/28e50ee5b72e90b50e7196fde8ea260e-Abstract-Conference.html
- Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. “The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models.” ICML 2025. Canonical URL: https://proceedings.mlr.press/v267/patil25a.html
- Berkeley Gorilla Team. “Berkeley Function-Calling Leaderboard V4.” 2026. Canonical URL: https://gorilla.cs.berkeley.edu/leaderboard
- Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. “τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.” arXiv 2024. Canonical URL: https://arxiv.org/abs/2406.12045
- Jiarui Lu et al. “ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities.” arXiv 2024 / NAACL Findings 2025. Canonical URL: https://arxiv.org/abs/2408.04682
- Yangjun Ruan et al. “Identifying the Risks of LM Agents with an LM-Emulated Sandbox.” ICLR 2024. Canonical URL: https://openreview.net/forum?id=GEcwtMk1uA
- Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. “AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents.” NeurIPS 2024 Datasets and Benchmarks. Canonical URL: https://openreview.net/forum?id=m1YYAQjO3w
- OpenAI. “Introducing Structured Outputs in the API.” 2024. Canonical URL: https://openai.com/index/introducing-structured-outputs-in-the-api/
- Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. “JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models.” ICML ES-FoMo Workshop 2025 / arXiv 2025. Canonical URL: https://arxiv.org/abs/2501.10868
- Yifan Le. “Schema Key Wording as an Instruction Channel in Structured Generation under Constrained Decoding.” arXiv 2026. Canonical URL: https://arxiv.org/abs/2604.14862
- Ella Rabinovich and Ateret Anaby-Tavor. “On the Robustness of Agentic Function Calling.” TrustNLP@NAACL 2025 / arXiv 2025. Canonical URL: https://arxiv.org/abs/2504.00914
- Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. “StruQ: Defending Against Prompt Injection with Structured Queries.” USENIX Security 2025. Canonical URL: https://www.usenix.org/conference/usenixsecurity25/presentation/chen-sizhe
- Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. “Defeating Prompt Injections by Design.” arXiv 2025. Canonical URL: https://arxiv.org/abs/2503.18813
- OpenAI. “Designing AI Agents to Resist Prompt Injection.” 2026. Canonical URL: https://openai.com/index/designing-agents-to-resist-prompt-injection/
- OWASP Foundation. “OWASP Top 10 for Large Language Model Applications.” 2025. Canonical URL: https://owasp.org/www-project-top-10-for-large-language-model-applications/
- OWASP GenAI Security Project. “LLM01:2025 Prompt Injection.” 2025. Canonical URL: https://genai.owasp.org/llmrisk/llm01-prompt-injection/
- Aim Security / EchoLeak authors. “EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System.” arXiv 2025. Canonical URL: https://arxiv.org/abs/2509.10540
- NIST National Vulnerability Database. “CVE-2025-32711.” 2025. Canonical URL: https://nvd.nist.gov/vuln/detail/CVE-2025-32711
- OECD.AI. “Replit AI Coding Tool Deletes Live Production Database and Fakes Data.” AI Incident entry, 2025. Canonical URL: https://oecd.ai/en/incidents/2025-07-19-1eb1
- Live Science. “AI Agent Deletes Company’s Entire Database in 9 Seconds.” 2026. Canonical URL: https://www.livescience.com/technology/artificial-intelligence/i-violated-every-principle-i-was-given-ai-agent-deletes-companys-entire-database-in-9-seconds-then-confesses