What Are Glitch Tokens? Under-Trained Tokens and the LLM Attack Surface
Quick Answer
Glitch tokens are entries in a language model's vocabulary that, when included in a prompt, disproportionately produce anomalous output: incoherence, refusals, truncations, loops, or silent data corruption. The usual root cause is under-training, where the tokenizer's vocabulary contains tokens the model rarely saw during pretraining. Their embeddings are unstable, so any prompt that uses them is unstable too. Glitch tokens are a reliability and security concern, not a single CVE-class bug.
What Are Glitch Tokens? Under-Trained Tokens and the LLM Attack Surface
Glitch tokens are vocabulary entries that cause a language model to behave erratically whenever they appear in a prompt. They are a side effect of how modern LLMs are built: the tokenizer is trained on one corpus, the model on another, and the seams between the two leak. For a security-aware practitioner, glitch tokens are a class of unreliability that widens the attack surface, confounds filtering and logging, and silently corrupts extraction pipelines.
What is a glitch token?
A glitch token is any token in a model's vocabulary that, when present in the input, has a disproportionate likelihood of producing anomalous output: incoherent text, unexplained refusals, truncations, infinite loops, or silent data corruption. The most common root cause is under-training — the term Land and Bartolo introduced in "Fishing for Magikarp" (EMNLP 2024) for tokens that exist in the tokenizer but appear rarely or never in the pretraining corpus.
The mental model is simple. Tokenization is part of the model. A token the model never properly learned is a hole in the model's input space. Prompts that step into that hole produce unreliable behavior, and that unreliability is a property of the model-and-tokenizer pair, not of any individual user.
How does it work?
Glitch tokens arise from the gap between vocabulary construction and pretraining:
- Vocabulary construction. A tokenizer (BPE, SentencePiece, or similar) is trained on some corpus to produce a fixed vocabulary, typically tens to hundreds of thousands of tokens. The vocabulary picks up byte-level fragments, subwords, scraping artifacts, leftover whitespace patterns, encoded control sequences, and code fragments.
- Pretraining mismatch. The model is then pretrained on a corpus that is not necessarily identical to the tokenizer's. Some vocabulary entries appear rarely or never. Their embedding vectors receive few meaningful gradient updates.
- Unstable representations. An under-trained token's embedding sits in a poorly-conditioned region of representation space. Attention and downstream activations don't behave the way they do for well-trained tokens.
- Anomalous output. The model returns unrelated text, refuses without cause, repeats, truncates, or — more dangerously — silently misinterprets the input.
The Glitcher toolkit distinguishes several behavioral flavors: hallucination and disruption; truncation and DoS-like behavior; bypass-like or injection-like shifts in instruction compliance; and extraction confusion. The last is worth singling out. On Llama 3.2 1B Instruct, the Glitcher v2.0 paper measured an 84% confusion rate across 1,800 test combinations when the model was asked to extract structured data from text containing representations of control characters: given an input like Adminnn followed by backspace representations, the model would silently "apply" the backspaces and return Admin as the extracted value. The input contains normal tokens, but the failure mode is the same class — model behavior on inputs near training-distribution edges is unreliable in ways that matter.
Detection itself is unsolved. Three method families coexist, each with known coverage gaps:
- Embedding-space methods flag tokens with anomalously low L2 norm or large distance from neighbors. Cheap, but the Glitcher audit found that 51.6% of glitch tokens have L2 norms within one standard deviation of the vocabulary mean. Norm thresholding misses most of them.
- Behavior-driven search uses gradient-guided discrete optimization to find tokens that maximize predictive entropy or another behavioral signal, then validates with multi-attempt probes. Better recall than norm methods, but biased toward whatever objective was chosen and toward Latin-script subword fragments.
- Brute-force vocabulary scans test every token. Highest coverage; the full-vocabulary census on Llama 3.2 1B Instruct found roughly ten times as many candidates as entropy mining alone. Compute-expensive but tractable on a consumer GPU for a 128K vocabulary.
A reliable measurement requires stochastic multi-attempt generation. The Glitcher audit found that an earlier implementation used greedy decoding for every "attempt," making each run byte-identical and collapsing Attack Success Rate (ASR) to a strict 0%/100% bimodal. With sampling at temperature 0.7, the true distribution is continuous: a meaningful fraction of glitch tokens are soft glitches that fail intermittently — exactly the failure mode hardest to debug in production.
Why does it matter?
Glitch tokens are not exotic. The full-vocabulary census on Llama 3.2 1B Instruct identified 1,131 confirmed glitch tokens at ASR ≥ 0.3, with a stratified-sample extrapolation suggesting on the order of 18,000 across the full 128K vocabulary. That is a non-trivial fraction of the model's input space.
Concretely, glitch tokens cause:
- Reliability failures that look like outages. A production prompt that occasionally contains a glitch token — through a username, URL, filename, or upstream pipeline — produces flaky output. Soft glitches with intermediate ASR are the worst case: they don't reproduce reliably in tests but appear at scale in production.
- Filtering and logging confusion. Malformed, truncated, or nonsensical output evades content filters, breaks structured logging, and sometimes prevents alerts from firing.
- Stress on guardrails. Some glitch tokens correlate with bypass-like or injection-like behavior. They are not jailbreaks themselves, but they reveal weak points an adversary will explore.
- Silent data manipulation in extraction pipelines. The encoded-character confusion result shows that LLM-based extractors will silently "apply" text representations of control characters during structured extraction. Any system that uses an LLM to parse user-supplied text into CSV, JSON, log fields, or credentials is exposed to this class.
- Increased attack surface. Tokens whose behavior is hard to predict are tokens an adversary can experiment with. The Glitcher toolkit's genetic-algorithm combination search shows that combinations of glitch tokens can amplify disruption or steer outputs toward chosen targets. Combinations don't transfer cleanly between models, but the search procedure exists.
This page describes the class at full detail and cites public research and open-source tooling. It does not publish specific glitch token strings, IDs, or vendor-coupled findings — those belong in coordinated disclosure, not in an explainer.
How do you defend against it?
Glitch tokens are a property of the model and its tokenizer, so defenses sit at the input boundary, the output boundary, or in onboarding. In rough order of leverage:
- Treat tokenizer-and-model as a versioned pair during onboarding. When you adopt a new model or update an existing one, run a glitch-token scan as part of the onboarding suite, the way you run jailbreak and regression suites. The open-source Glitcher toolkit is one option. Cost: compute (hours, not days, for typical vocabulary sizes) and engineering time to wire into evaluation. Does not cover: glitch tokens that only appear under prompt structures your scan didn't try.
- Input normalization at the application boundary. Canonicalize Unicode, strip or normalize control characters, reject obviously-malformed byte sequences, and constrain inputs to the language and character set your application actually needs. Cost: mild engineering effort; some legitimate inputs may be rejected. Does not cover: glitch tokens composed of normal-looking subwords.
- Token-aware filtering for known glitches. Where a scan has identified specific high-impact glitch tokens for the model you're deploying, block or transform them at the text level before they reach the model. Keep the blocklist private and version it with the model. Cost: ongoing maintenance; one more thing to keep in sync. Does not cover: unknown glitches or tokens introduced by tokenizer or model updates.
- Output-side anomaly monitoring. Detect malformed, truncated, repeated, or refusal-like output patterns and trigger retries, fallbacks, or human review. Catches glitch behavior even when the input wasn't obviously suspicious. Cost: engineering effort and a modest false-positive rate. Does not cover: silent corruption where the output looks plausible but is wrong.
- Don't trust LLMs as literal extractors of attacker-controlled text. If an LLM is parsing user-supplied data into a structured output, assume text representations of control characters and other near-edge inputs may be silently transformed. Validate extractions against a deterministic post-check, or use a non-LLM parser for the literal-extraction step. Cost: engineering effort and possibly less flexibility. Does not cover: extraction tasks that genuinely require LLM understanding.
- Use multi-attempt stochastic validation for measurement. When measuring whether a token is a glitch, sample at temperature 0.7 or so across multiple attempts and report the full ASR distribution, not a binary pass/fail. Soft glitches are real. Cost: more compute per measurement. Does not cover: tokens that only glitch under prompts your measurement didn't try.
- Coordinate disclosure of high-impact findings to model vendors. If a glitch token has material safety, reliability, or privacy implications, report it privately and follow standard coordinated-disclosure practice. Do not publish raw token strings or reproduction guidance. Cost: time and the discipline to keep findings private until they are fixed. Does not cover: anything; this is a process control.
Tokenizer pruning, vocabulary remapping, and intermediate-layer "repair" are also possible but require model access and careful regression testing. They are research-grade defenses for organizations that train or fine-tune their own models.
Related concepts and tools
- The Glitcher v2.0 toolkit paper is the source for the ASR methodology, the behavioral categorization, the genetic-algorithm combination search, and the encoded-character confusion finding.
- The Glitcher ASR validation audit is the source for the determinism bug, the mining-versus-scan coverage gap, and the corrected continuous ASR distribution.
- Other security explainers on richards.ai cover related classes of unreliable model behavior on inputs near training-distribution edges, including multi-agent prompt injection.
- The full set of research papers covers adjacent work in LLM security, agent safety, and evaluation.
FAQ
Are glitch tokens a security vulnerability?
They are a reliability and security concern, not a single CVE-class vulnerability. Glitch tokens cause unstable behavior, complicate filtering and logging, and widen the attack surface around prompts and extraction pipelines. Whether any particular glitch token is exploitable depends on how the model is deployed, what surrounds it, and what other controls are in place.
How are glitch tokens discovered?
Three method families exist: embedding-space anomaly detection (low L2 norm, neighbor distance), behavior-driven search (entropy-guided gradient mining with multi-attempt validation), and brute-force vocabulary scans. No single method has good coverage. Full-vocabulary scans on Llama 3.2 1B Instruct found roughly ten times as many candidates as entropy mining alone, and norm-based filters miss most glitch tokens entirely.
What should I do if I find one in production?
Don't panic and don't publish the token. Reproduce with multi-attempt stochastic generation (temperature around 0.7) to confirm it isn't a single-run artifact. Add input-side normalization or a token-level filter, monitor for the failure pattern, and report it to the model vendor under coordinated disclosure if the impact is material. Avoid posting raw token strings publicly.
Derived From
External References
- Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models (Land and Bartolo, EMNLP 2024)
- Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection (Li et al., 2024)
- GlitchMiner: Mining Glitch Tokens via Gradient-based Discrete Optimization (Wu et al., 2024)
- Glitcher toolkit (open source)
FAQ
Are glitch tokens a security vulnerability?
They are a reliability and security concern, not a single CVE-class vulnerability. Glitch tokens cause unstable behavior, complicate filtering and logging, and widen the attack surface around prompts and extraction pipelines. Whether any particular glitch token is exploitable depends on how the model is deployed, what surrounds it, and what other controls are in place.
How are glitch tokens discovered?
Three method families exist: embedding-space anomaly detection (low L2 norm, neighbor distance), behavior-driven search (entropy-guided gradient mining with multi-attempt validation), and brute-force vocabulary scans. No single method has good coverage. Full-vocabulary scans on Llama 3.2 1B Instruct found roughly ten times as many candidates as entropy mining alone, and norm-based filters miss most glitch tokens entirely.
What should I do if I find one in production?
Don't panic and don't publish the token. Reproduce with multi-attempt stochastic generation (temperature around 0.7) to confirm it isn't a single-run artifact. Add input-side normalization or a token-level filter, monitor for the failure pattern, and report it to the model vendor under coordinated disclosure if the impact is material. Avoid posting raw token strings publicly.