Constrained Decoding

Constrained decoding is a sampling-time technique that restricts a language model's token-by-token output to tokens that keep the running prefix consistent with a target grammar — most often a JSON Schema, regex, or context-free grammar. At each decoding step, the runtime masks tokens that would make the prefix unextendable under the grammar and samples only from the remainder, so the final output is structurally guaranteed to conform. It is qualitatively different from prompting: prompting asks the model to behave; constrained decoding changes the set of outputs the model can emit at all. It guarantees shape, not meaning — a tool call can be schema-valid and still target the wrong tool with the wrong arguments. It sits at the grammar layer of the tool-use reliability stack.

Constrained Decoding

Constrained Decoding

See also

Derived From

Related Work

External References