Tool-Use Reliability
Quick Answer
Tool-use reliability is the end-to-end property of an LLM agent that every tool call it emits is syntactically well-formed, schema-valid, semantically correct, state-consistent, and authorized. It spans five layers — syntax, schema, semantics, state, and authority — of which function calling and structured outputs cover only the lowest two or three. Production incidents typically occur at the upper layers, where the model is implicitly trusted to self-restrict.
Tool-Use Reliability
Tool-use reliability is the end-to-end property of a tool-using LLM agent that every action it emits is syntactically well-formed, schema-valid, semantically correct, state-consistent, and authorized. It is a distributed-systems boundary property, not a model feature. Function calling is the serialization protocol and structured outputs enforce the grammar layer; tool-use reliability is the whole stack on top.
The source paper decomposes the property into five layers: syntactic, schema, semantic, state, and authority validity. Structured output enforcement addresses layers one and two; function-calling fine-tuning helps through layer three. Most public production incidents — destructive operations on databases, cloud resources, or code repositories — occur at layers four and five, where the planner is implicitly trusted to self-restrict rather than gated by an external policy boundary.
See also
- Tool hijacking — nearest-neighbor failure at the semantic/selection layer
- Excessive agency — failure at the authority layer this term subsumes
- Ambient authority — related concept for the authority-validity layer
- Compound AI system — the architectural unit on which reliability is measured