Glitcher: Mining and Classifying Glitch Tokens in Large Language Models
Glitcher: A Research Toolkit for Detecting, Characterizing, and Steering Glitch Tokens in Large Language Models
Version: 2.0 Date: 2026-02-25 Author: Jeremy (Independent Security Researcher) Repository: https://github.com/binaryninja/glitcher License: MIT
Responsible-use note: This whitepaper and the Glitcher toolkit are intended for defensive security research, model reliability evaluation, and safety engineering. It does not publish token lists or step-by-step guidance intended to evade safeguards or enable abuse. If you discover a vulnerability affecting a deployed system, follow coordinated disclosure practices.
Abstract
Large language models (LLMs) rely on tokenization to map text to discrete symbols before processing. In practice, many tokenizers contain tokens that are rare, under-trained, or otherwise anomalous relative to the model's training distribution. When such tokens appear in prompts---sometimes via innocuous-looking text---they can trigger unexpected, unstable, or low-quality behavior, including nonsensical completions, irrelevant outputs, refusals, truncations, or other pathological generation patterns.
This whitepaper introduces Glitcher, a modular research toolkit designed to detect, characterize, and mitigate "glitch tokens" across model families and tokenizer types, and to evolve adversarial token combinations that can steer model outputs toward specific targets. Glitcher integrates and systematizes multiple complementary lines of research: embedding-space anomaly detection, entropy-guided gradient search, multi-attempt behavioral validation, genetic-algorithm-based token combination search, and encoded character confusion testing. The result is a reproducible workflow for: (1) identifying candidate tokens, (2) measuring their behavioral impact via attack success rate (ASR) validation, (3) evolving token combinations that maximize probability reduction or output steering, (4) testing whether models decode text representations of control characters during structured extraction, (5) validating findings across multiple providers, and (6) applying pragmatic mitigations in model pipelines.
Executive summary
Glitcher is built around four core ideas:
- Glitch tokens exist and matter. Prior work shows that tokenization/training mismatches can yield tokens that are effectively under-trained, and that interaction with these tokens can compromise response quality and reliability.
- No single detector is sufficient. Some glitch tokens appear as embedding outliers; others reveal themselves via attention/activation anomalies or via behavior-driven search. Glitcher uses multiple detectors and a shared scoring/reporting layer.
- Mitigation can be practical. Many real-world systems can reduce risk via runtime filtering, monitoring, targeted fine-tuning, or intermediate-layer "repair" strategies, even without retraining from scratch.
- Token combinations can be evolved to steer outputs. Individual glitch tokens have measurable impact, but combinations of tokens---discovered through evolutionary search---can dramatically amplify probability reduction or redirect model outputs toward specific target tokens.
What you get from Glitcher:
- A repeatable pipeline to scan token vocabularies, generate candidates, run behavioral probes, and output ranked findings with ASR-validated severity scores.
- An evolutionary search engine (GeneticProbabilityReducer) that breeds token combinations to maximize probability disruption or steer outputs toward a wanted token.
- A multi-provider validation framework to reproduce findings across OpenAI, Anthropic, Mistral, Lambda, OpenRouter, and local HuggingFace models with quantization support.
- A modular classification system with pluggable test categories (injection, DoS, hallucination, disruption, bypass, email/domain extraction, control character and encoded character confusion).
- An encoded character confusion test suite that evaluates whether models decode text representations of control characters (e.g.,
\x08,\u0008,^H) during structured extraction---with empirical results showing 84% confusion rates across 1,800 test combinations. - A mitigation playbook oriented toward production guardrails (filtering/monitoring) and research mitigations (patching, retraining experiments).
Table of contents
- 1. Introduction
- 2. Background and terminology
- 3. Why glitch tokens matter
- 4. Related work
- 5. The Glitcher toolkit
- 6. Detection and characterization methodology
- 7. Adversarial token combination search
- 8. Cross-provider validation
- 9. Mitigation strategies
- 10. Evaluation methodology
- 11. Responsible release and disclosure
- 12. Limitations and future work
- 13. Conclusion
- References
- Appendix A: Report schema
- Appendix B: Reproducibility checklist
1. Introduction
Tokenization is often treated as plumbing: text goes in, tokens come out. But tokenization is part of the model, and subtle properties of a tokenizer's vocabulary can meaningfully shape model behavior. Over the last few years, researchers and practitioners have repeatedly observed "glitchy" behavior caused by certain tokens---tokens that can be present in a tokenizer vocabulary yet be rare or absent in training, or tokens whose internal representations produce unstable downstream dynamics.
Glitcher operationalizes a simple principle:
If a token (or short token sequence) reliably produces anomalous behavior, you should be able to (a) find it, (b) measure it, (c) monitor it, and (d) mitigate it.
This whitepaper focuses on the engineering and research workflow needed to do exactly that.
2. Background and terminology
2.1 Tokenization and vocabulary artifacts
Most LLMs tokenize text into subword units (e.g., byte-pair encoding or unigram/SentencePiece variants). The tokenizer vocabulary is built from a corpus and an optimization procedure---not necessarily from the exact same data distribution used for model pretraining. This mismatch can introduce tokens that are effectively under-trained (insufficiently represented during model training), or otherwise behave oddly when embedded and processed.
2.2 What is a "glitch token"?
This paper uses glitch token as a broad, behavior-first term:
- A glitch token is any token that, when included in input, has a disproportionate likelihood of causing anomalous model behavior: degraded coherence, irrelevant or incorrect output, pathological repetition, refusal/truncation artifacts, or other instability.
A related term used in the literature is under-trained token:
- An under-trained token is a token present in the tokenizer vocabulary that appears rarely or not at all during pretraining, producing poorly trained internal representations and potentially unreliable behavior.
Not all glitch tokens are necessarily under-trained, but under-training is a common root cause and a practical detection signal.
3. Why glitch tokens matter
3.1 Reliability risk (quality and stability)
Glitch tokens can cause models to:
- generate irrelevant or incoherent text,
- fail to follow instructions,
- produce truncated or blank outputs,
- become unstable across repeated runs under "deterministic" settings (depending on implementation details),
- or otherwise degrade response quality in ways that are hard to debug.
In production, these behaviors can look like random outages, flakiness, or inexplicable regression---especially when the triggering token is hidden behind formatting, encoding, or an upstream pipeline.
3.2 Safety and security risk (defensive perspective)
From a security and safety standpoint, glitch tokens matter because they can:
- complicate policy enforcement and filtering, if the model behaves unexpectedly or produces malformed outputs,
- confound logging and auditing, if outputs are truncated or nonsensical,
- stress-test guardrails, revealing weak points in prompting, tool orchestration, or safety monitoring,
- increase attack surface by introducing hard-to-predict behavior tied to specific token sequences.
This whitepaper does not publish exploit token strings or "how to bypass" procedures. The objective is to improve defensibility by identifying and mitigating these failure modes.
3.3 Operational risk (incident response and regression testing)
Organizations integrating new models or tokenizer versions should treat glitch-token analysis as a standard part of model onboarding, similar to:
- prompt regression suites,
- jailbreak and abuse testing,
- PII and data leakage checks,
- and performance benchmarking.
Glitcher is designed to plug into those workflows.
4. Related work
Glitcher is informed by and aligned with several research threads:
- Taxonomy + embedding-space detection: Work that defines glitch tokens and proposes iterative clustering in embedding space for detection ("GlitchHunter").
- Under-trained token detection: Work highlighting tokenizer/training disconnects and methods for identifying under-trained vocabulary entries ("Fishing for Magikarp").
- Intermediate-layer analysis + repair: Work demonstrating that glitch tokens correlate with deviations in attention and intermediate-layer dynamics and proposing detection + mitigation ("GlitchProber").
- Behavior-driven mining: Work proposing gradient-guided, entropy-maximizing search in discrete token space to discover glitch tokens ("GlitchMiner").
See the References section for full citations.
5. The Glitcher toolkit
5.1 Design goals
Glitcher is designed to be:
- Model-agnostic: support different architectures and deployment stacks (open weights, APIs where permitted, etc.).
- Tokenizer-aware: explicitly represent tokenizer type, vocab size, and byte/unicode behavior.
- Multi-signal: combine embedding/weight signals, runtime behavior signals, and optional internal telemetry signals.
- Multi-provider: validate findings across providers (OpenAI, Anthropic, Mistral, Lambda, OpenRouter, local HuggingFace transformers) via a pluggable provider registry.
- Reproducible: deterministic experiment configs, fixed seeds where possible, and structured reporting.
- Defensive by default: avoid publishing token lists that meaningfully increase offensive capability; enable redaction/hashing in reports.
5.2 High-level workflow
- Enumerate tokens (token inventory + metadata).
- Generate candidates using lightweight heuristics and anomaly scores (fast pass).
- Characterize behavior using controlled prompt templates and repeated trials (behavioral pass).
- Validate candidates via multi-attempt ASR measurement to reduce false positives.
- Optional deep probes: intermediate-layer/attention deviation monitoring (if internal access is available).
- Evolve combinations: genetic algorithm search for adversarial token combinations that amplify probability reduction or steer outputs.
- Cross-provider validation: reproduce findings across multiple providers and quantization levels.
- Rank + report: aggregate signals into a risk score and emit artifacts.
- Mitigate: filtering/monitoring, patching experiments, and documentation for downstream users.
5.3 Architecture
5.4 Core modules
Glitcher organizes its implementation into modules that can be enabled independently:
- Candidate generation (
model.py): L2 norm analysis on embedding vectors to identify tokens with anomalously low norms (indicating under-training), combined with embedding distance clustering to surface tokens far from well-trained neighborhoods. - Entropy-guided mining (
model.py): Gradient-based discrete optimization that searches token space using predictive entropy as the objective function, with iterative batch evaluation and ASR validation to confirm candidates. - Behavioral classification (
classification/): A modular classifier with pluggable test categories---injection/jailbreak, infinite/DoS, hallucination, disruption, bypass, email extraction, domain extraction, control character confusion, and encoded character confusion---each with configurable indicator functions and priority ordering. - Enhanced validation (
enhanced_validation.py): Multi-attempt ASR measurement that generates 30--100+ tokens per trial across multiple attempts, reducing false positive rates from ~30--50% (single-probe) to ~5--15%. - Genetic combination search (
genetic/reducer.py): TheGeneticProbabilityReducerevolves populations of token combinations using selection, crossover, and mutation to maximize probability reduction or output steering (see Section 7). - Provider registry (
poc/providers/): AbstractBaseProviderwith implementations for OpenAI, Anthropic, Mistral, Lambda, OpenRouter, and local HuggingFace transformers with int4/int8/fp16/bf16 quantization support.
5.5 CLI interface
Glitcher provides three CLI entry points:
| Command | Description |
|---|---|
glitcher mine MODEL | Mine for glitch tokens using entropy-guided gradient search |
glitcher test MODEL --token-ids ID1,ID2 | Test specific tokens with enhanced multi-attempt validation |
glitcher compare MODEL --token-ids ID1,ID2 | Compare token behavior across prompt templates |
glitcher genetic MODEL --base-text TEXT | Evolve token combinations to reduce target probability |
glitcher genetic MODEL --wanted-token TOKEN | Evolve combinations to steer output toward a wanted token |
glitcher classify MODEL --token-file FILE | Classify tokens into behavioral categories |
glitcher chat MODEL TOKEN_ID | Interactive chat session with a glitch token injected |
glitcher domain MODEL --token-ids ID1,ID2 | Domain extraction analysis |
glitcher classify MODEL --encoded-char-standalone | Run encoded character confusion matrix (no tokens needed) |
glitcher classify MODEL --encoded-char-only --token-file FILE | Run encoded character confusion with glitch tokens |
glitcher classify MODEL --control-char-standalone | Run raw control character confusion matrix |
glitcher gui | Launch the tkinter GUI with real-time visualization |
glitch-scan | Scan and validate glitch tokens |
glitch-classify | Classify glitch tokens from a token file |
6. Detection and characterization methodology
Glitcher treats "is this token risky?" as an empirical question answered via multiple signals.
6.1 Candidate generation signals (fast pass)
Glitcher implements the following candidate generation strategies:
- L2 norm thresholding: Tokens whose embedding vectors have anomalously low L2 norms are flagged as likely under-trained. Low-norm embeddings indicate the token received insufficient gradient updates during pretraining.
- Embedding distance clustering: Tokens that are far from their nearest well-trained neighbors in embedding space, measured via cosine or Euclidean distance, are surfaced as outlier candidates.
- Range-based mining: Targeted scanning of specific Unicode ranges (e.g., CJK, Cyrillic, special symbol blocks) and special token ID ranges to find tokens that tokenizer construction artifacts may have introduced.
- Tokenizer artifact heuristics: Unusual byte sequences, control-like renderings, whitespace-heavy tokens, or fragments that appear "code-ish" are flagged for downstream validation.
This phase prioritizes recall (finding many candidates), while downstream characterization and ASR validation improve precision.
6.2 Behavioral characterization (controlled probing)
Behavioral probes are designed to be:
- low risk (do not request harmful content),
- repeatable, and
- diagnostic (help classify the anomaly type).
Glitcher classifies confirmed glitch tokens into the following categories:
| Category | Description |
|---|---|
| Injection | Token triggers prompt injection or jailbreak-like behavior |
| IDOS | Token causes infinite generation or denial-of-service loops |
| Hallucination | Token produces nonsensical or fabricated output |
| Disruption | Token disrupts internal reasoning or chain-of-thought |
| Bypass | Token enables filter or guardrail bypass |
| EmailExtraction | Token disrupts email address extraction from model output |
| DomainExtraction | Token disrupts domain name extraction from model output |
| ControlCharConfusion | Model interprets raw control character bytes semantically during extraction |
| EncodedCharConfusion | Model decodes text representations of control characters (e.g. \x08) and applies their semantics |
Each category has dedicated test templates with indicator functions that check model responses for category-specific anomaly patterns.
ASR validation methodology
Glitcher uses Attack Success Rate (ASR) as its primary confirmation metric. Rather than relying on a single probe response, the enhanced validation module:
- Runs multiple independent generation attempts per token (configurable, typically 3--10).
- Generates 30--100+ tokens per attempt to observe downstream behavioral effects.
- Computes ASR as the fraction of attempts where anomalous behavior is detected.
- Applies a configurable threshold (default: 0.5) to confirm a token as glitchy.
This multi-attempt approach reduces false positive rates from ~30--50% (single-probe methods) to ~5--15%, providing higher confidence in reported findings.
6.3 Optional deep probes (internal telemetry)
Where you can instrument the model (open weights or research environment), Glitcher can compute signals such as:
- attention distribution deviation: e.g., divergence from baseline patterns on matched-length "normal" inputs,
- intermediate representation drift: layerwise activation statistics compared to baselines,
- PCA-based feature extraction: compact summaries for classification/ranking.
6.4 Mining for additional glitch tokens (behavior-driven search)
Heuristic detectors can miss tokens that are not obvious outliers. Glitcher's mining implementation uses entropy-guided gradient search:
- Objective: Maximize predictive entropy---tokens that cause the model to be maximally uncertain about the next token are strong glitch candidates.
- Batch evaluation: Tokens are evaluated in configurable batch sizes (default: 8) across multiple iterations (default: 50), with the top-k highest-entropy tokens retained per batch.
- Multi-prompt validation: Candidates are tested across multiple prompt templates to filter tokens that only trigger anomalies in narrow contexts.
- Enhanced validation gate: Surviving candidates undergo ASR validation (see 6.2) to confirm reliability before inclusion in the final token list.
- Resumable progress: Mining state is checkpointed to
glitch_progress.json, allowing long runs to be interrupted and resumed.
This phase is computationally heavier and is typically used for targeted investigations or model onboarding.
6.5 Encoded character confusion testing
A separate class of model vulnerability involves control characters---not as raw bytes in the token stream, but as text representations that the model "mentally decodes" during structured extraction tasks.
Motivation
In practice, most LLM input surfaces (web forms, APIs, chat interfaces) sanitize or reject raw control character bytes. An attacker cannot typically send 0x08 (backspace) directly. However, an attacker can send the literal text \x08, 0x08, %08, \b, ^H, or BACKSPACE---and the model may interpret these text strings as if they were actual control characters, applying their semantics during extraction.
For example, given the input L: Adminnn\x08\x08 P: ASCII0x08, the model should extract the username as the literal string Adminnn\x08\x08. A confused model instead "applies" the backspaces and returns Admin---silently transforming the extracted data.
Test matrix design
The EncodedCharTester evaluates a matrix of four dimensions:
| Dimension | Values | Count |
|---|---|---|
| Target characters | backspace, tab, newline, carriage return, null, escape, delete | 7 |
| Encoding formats | \x08, \x08, 0x08, 0x08, %08, %08, \u0008, \u0008, , , ^H, \b, BACKSPACE | ~13 |
| Reinforcer strings | hex reference (ASCII0x08), name reference (BACKSPACE), escape reference (\b), instruction text, none | 5 |
| Extraction scenarios | credential (L:/P:), JSON field, log line, CSV columns | 4 |
This produces approximately 1,800 test combinations per model. Each test presents the model with a structured extraction prompt containing the encoded character as literal text, then checks whether the model extracted the text faithfully or "decoded" it.
Empirical results: Llama 3.2 1B Instruct
Running the full 1,800-combination matrix against meta-llama/Llama-3.2-1B-Instruct produced the following results:
Overall confusion rate: 84% (1,509 / 1,800 tests)
By extraction scenario:
| Scenario | Confusion rate |
|---|---|
| Log line parsing | 100% |
| CSV column extraction | 91% |
| Credential extraction | 82% |
| JSON field extraction | 63% |
Log line parsing proved most vulnerable---the model decoded the encoded character in every single test combination. JSON field extraction was most resilient, likely because the structured JSON context provides stronger anchoring for literal extraction.
By encoding format:
| Encoding format | Example | Confusion rate |
|---|---|---|
hex_lower | \x08 | 100% |
hex_upper | \x08 | 100% |
c_escape | \b | 100% |
unicode_lower | \u0008 | 100% |
unicode_upper | \u0008 | 100% |
unicode_html |  | 100% |
unicode_decimal |  | 97% |
caret | ^H | 94% |
url_encode | %08 | 74% |
url_encode_upper | %08 | 72% |
ascii_name | BACKSPACE | 60% |
hex_bare | 0x08 | 52% |
hex_bare_upper | 0x08 | 43% |
Escape-style encodings (\x, \u, \b, &#x) achieve 100% confusion. These are the formats the model most likely encountered during training in source code and markup, meaning the model has learned to "understand" them---and applies that understanding even when instructed to extract literally. Bare hex (0x08) and plain names (BACKSPACE) are less effective but still confuse the model in the majority of cases.
Reinforcer influence: Companion text (e.g., ASCII0x08 in the password field alongside \x08 in the username) adds a +4.1% confusion rate delta compared to no reinforcer, demonstrating a measurable priming effect.
Most common failure modes:
| Issue | Count |
|---|---|
literal_extraction_failure | 1,057 |
encoding_decoded | 575 |
field_confusion | 548 |
truncation | 458 |
message_extraction_failure | 408 |
json_parse_failure | 340 |
The encoding_decoded label specifically indicates that the model shortened a string as if the encoded control character were applied (e.g., Adminnn\x08\x08 extracted as Admin).
Interaction with glitch tokens
The encoded character confusion test also operates in token mode, where a glitch token is inserted into the extraction prompt alongside encoded characters. This tests the double-reinforcement hypothesis: a glitch token whose decoded text resembles a control character name (e.g., a token containing ASCII0x08) may amplify the model's tendency to decode the text representation. The --encoded-char-only CLI flag runs this mode against a supplied token file.
Security implications
This finding has practical implications for any system that uses an LLM to extract, parse, or validate structured data from user-supplied input:
- Data exfiltration via extraction manipulation: An attacker who controls part of the input can use encoded control characters to cause the model to silently transform extracted values---changing usernames, hostnames, or field contents in ways that bypass literal-matching validation.
- Log injection and forensic manipulation: In log parsing pipelines, encoded characters can cause the model to merge, split, or truncate fields, potentially obscuring evidence of malicious activity.
- CSV/structured data corruption: Encoded characters in CSV or tabular data can cause field confusion, where values bleed across column boundaries after the model "applies" the encoded character.
These attacks require no special tokens or model-specific knowledge---only awareness that the target system uses an LLM for extraction and that the model is likely to decode common escape sequences.
7. Adversarial token combination search
Beyond detecting individual glitch tokens, Glitcher implements an evolutionary search for combinations of tokens that amplify disruptive effects or steer model outputs toward specific targets.
7.1 The GeneticProbabilityReducer
The GeneticProbabilityReducer maintains a population of token combinations (1--N tokens each, configurable via --max-tokens) and evolves them over generations using tournament selection, single-point crossover, and mutation. Each individual's fitness is evaluated by measuring its effect on the model's next-token probability distribution.
7.2 Operating modes
Single-objective: probability reduction
The default mode maximizes the reduction of the model's probability for the most likely next token. Given a base text (e.g., "The quick brown"), the algorithm measures how much prepending a token combination reduces the probability of the naturally predicted next token (e.g., "fox").
glitcher genetic MODEL --base-text "The quick brown" --generations 50
Multi-objective: reduce + steer
When a --wanted-token is specified, the fitness function combines two objectives: reducing the probability of the original prediction and increasing the probability of the wanted token. This enables targeted output steering---finding token sequences that cause the model to produce a specific token it would not otherwise generate.
glitcher genetic MODEL --wanted-token "fox" --comprehensive-search
Comprehensive vocabulary search
The --comprehensive-search flag scans the full model vocabulary (not just known glitch tokens) to find optimal seeds for the genetic algorithm. Results are cached in cache/comprehensive_search/ with content-addressed keys for reuse across runs.
7.3 Population seeding strategies
The genetic algorithm uses multiple seeding strategies to balance exploration and exploitation:
- Baseline-guided seeding: Individual token impact is measured against the target probability, and the highest-impact tokens seed the initial population. Results are cached in
token_impact_baseline.json. - Glitch token seeding: Known glitch tokens from prior mining runs are included as initial population members.
- Random seeding: A portion of the population is initialized randomly to maintain diversity.
- Hybrid combinations: High-impact individuals are combined with known glitch tokens to create hybrid seeds.
7.4 Adaptive mutation and diversity injection
To prevent premature convergence, the algorithm employs:
- Adaptive mutation rates: Mutation probability increases when population fitness stagnates.
- Diversity injection: New random individuals are periodically injected into the population when diversity drops below a threshold.
- ASCII-only filtering: An optional
--ascii-onlyflag restricts the search to ASCII tokens, useful when targeting models that handle Unicode poorly.
7.5 Real-time visualization
The genetic algorithm supports real-time visualization through two interfaces:
- Matplotlib animation (
gui_animator.py): Live fitness plots, population diversity charts, and best-individual tracking. - Tkinter GUI (
gui_controller.py): Full control panel with configuration, progress monitoring, and results tabs. Activated via--gui.
8. Cross-provider validation
Glitcher includes a multi-provider framework for validating glitch token findings across different LLM deployments.
8.1 Provider registry
The poc/providers/ module implements a registry pattern with a common BaseProvider abstract class. Each provider exposes a uniform interface for sending prompts and collecting responses, enabling automated cross-provider comparison.
8.2 Supported providers
| Provider | Module | Notes |
|---|---|---|
| OpenAI | openai_provider.py | GPT-4, GPT-3.5-turbo, etc. |
| Anthropic | anthropic_provider.py | Claude model family |
| Mistral | mistral.py | Mistral models |
| Lambda Labs | lambda_ai.py | Lambda-hosted models |
| OpenRouter | openrouter_provider.py | Multi-provider routing |
| Local HuggingFace | transformers_provider.py | Local models with quantization |
8.3 Quantization support
The local transformers provider supports multiple quantization levels for testing how precision affects glitch token behavior:
- int4: 4-bit quantization via bitsandbytes (lowest memory, fastest inference)
- int8: 8-bit quantization via bitsandbytes
- fp16: Half-precision floating point
- bf16: Brain floating point (bfloat16)
This enables researchers to determine whether a glitch token finding is robust across precision levels or is an artifact of quantization.
9. Mitigation strategies
Mitigation depends on your deployment constraints and how much control you have over the model.
9.1 Runtime filtering and normalization (production-friendly)
For many systems, the most practical mitigations are:
- Input normalization: canonicalize unicode, strip/normalize control characters, enforce a safe encoding path.
- Token-aware filtering: block or transform known high-risk tokens (or short sequences) at the text level when feasible.
- Runtime monitoring: detect abnormal output patterns (e.g., truncation loops, repeated phrases) and trigger retries/fallbacks.
These mitigations can be applied without changing model weights.
9.2 Token vocabulary interventions (research / controlled deployments)
Where permissible:
- Vocabulary pruning / remapping: remove or remap problematic tokens (requires careful compatibility handling).
- Targeted fine-tuning: train the model on safe tasks that include the flagged tokens, improving representation quality.
- Tokenizer retraining: rebuild tokenization on a corpus aligned with training distribution (rarely feasible for deployed models).
9.3 Intermediate-layer "repair" (advanced research)
Some approaches attempt to mitigate the destructive effects of glitch tokens by altering intermediate computations. This is best treated as research until validated for your environment, with careful measurement of side effects and regressions.
10. Evaluation methodology
Glitcher is intended to support reproducible evaluation across:
- models: different architectures and families,
- tokenizers: BPE variants, SentencePiece/unigram, etc.,
- prompt templates: neutral probes, controlled formatting tasks,
- settings: temperature/top-p, max tokens, stop sequences, etc.
10.1 Recommended experiment design
A minimal evaluation plan:
- Select a model + tokenizer.
- Run candidate generation on full vocabulary (
glitcher mine). - Select top-k candidates per signal (plus a random control set).
- Run behavioral probes with repeated trials and ASR validation.
- Classify confirmed tokens (
glitcher classify). - Evolve adversarial combinations (
glitcher genetic). - Validate across providers where applicable.
- Compute precision/recall against a labeled subset (human-reviewed).
- Validate mitigations via regression tests.
10.2 Suggested metrics
- ASR (Attack Success Rate): Fraction of multi-attempt trials where a token triggers anomalous behavior. Default confirmation threshold: 0.5. Range: 0.0 (any anomaly) to 1.0 (always anomalous).
- L2 norm: Embedding vector magnitude---low values correlate with under-training.
- Predictive entropy: Model uncertainty when processing the token---high values indicate instability.
- Token impact baseline: Per-token probability reduction measured against the default next-token prediction, used to rank individual token effectiveness.
- Fitness score: Combined metric used by the genetic algorithm, incorporating target probability reduction and (optionally) wanted-token probability increase.
- Detection metrics: Precision, recall, F1 on labeled candidates.
- Query efficiency: Number of model calls per discovered glitch token.
- Behavioral severity: Weighted score across refusal/truncation/instability measures.
- Mitigation efficacy: Reduction in severity and incidence without harming normal performance.
10.3 Example results table (template)
| Model | Tokenizer | Candidates scanned | Glitch tokens confirmed | ASR threshold | Precision@k | Recall@k | Notes |
|---|---|---|---|---|---|---|---|
| [MODEL_A] | [TOK_A] | [N] | [M] | [ASR] | [P@k] | [R@k] | [short note] |
| [MODEL_B] | [TOK_B] | [N] | [M] | [ASR] | [P@k] | [R@k] | [short note] |
11. Responsible release and disclosure
Glitch token research can materially improve safety, but it can also be misused if published irresponsibly. Recommended practices:
- Avoid releasing raw token lists for high-impact proprietary models unless there is a clear defensive benefit and mitigations exist.
- Redact or hash sensitive tokens in public reports and provide details privately to affected maintainers.
- Coordinate disclosure with model providers when you identify an exploitable or high-severity behavior.
- Document safe test prompts and avoid publishing prompts that elicit harmful content.
- Include mitigation guidance with any findings.
12. Limitations and future work
12.1 Limitations
- Glitch token definitions are inherently behavioral, and behavior depends on decoding settings, system prompts, and serving stack details.
- Some mitigations (e.g., retraining tokenizers, intermediate-layer repair) are non-trivial and require careful validation.
- Closed model APIs may restrict observability and can change over time.
- The genetic combination search is computationally expensive and results are model-specific---combinations that steer one model may not transfer to another.
- Cross-provider validation depends on API availability and rate limits, which can constrain large-scale reproducibility studies.
12.2 Future work
- Better ground truth datasets for glitch tokens across model families.
- Standard benchmarks for tokenizer-induced reliability failures.
- Safer public reporting formats that maximize defensive value without increasing abuse.
- Transfer learning for genetic algorithm results across related model families.
- Cross-model encoded character confusion comparison---measuring whether larger models or instruction-tuned variants are more or less susceptible to decoding text-encoded control characters.
- Integration with automated red-teaming and continuous monitoring pipelines.
13. Conclusion
Glitch tokens represent a practical, underappreciated interface between tokenization and model behavior. As LLMs become embedded in critical workflows, token-level anomalies become reliability and security concerns---not curiosities.
Glitcher provides a reproducible toolkit and methodology to detect, characterize, and mitigate glitch tokens, combining multiple complementary research approaches into a single workflow. Beyond individual token detection, the evolutionary combination search demonstrates that carefully constructed token sequences can amplify disruption or steer model outputs---capabilities that defenders must understand to build effective guardrails. The encoded character confusion findings further demonstrate that models actively decode text representations of control characters during structured extraction, creating a class of data manipulation vulnerabilities that requires no special tokens or model-specific knowledge to exploit. The result is a defensible, measurable way to reduce tokenizer-induced instability and improve trust in LLM deployments.
References
-
Yuxi Li, Yi Liu, Gelei Deng, Ying Zhang, Wenjia Song, Ling Shi, Kailong Wang, Yuekang Li, Yang Liu, Haoyu Wang. "Glitch Tokens in Large Language Models: Categorization Taxonomy and Effective Detection." arXiv:2404.09894 (2024). DOI: 10.48550/arXiv.2404.09894.
-
Sander Land, Max Bartolo. "Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models." arXiv:2405.05417 (2024). DOI: 10.48550/arXiv.2405.05417. (Accepted at EMNLP 2024, main track.)
-
Zhibo Zhang, Wuxia Bai, Yuxi Li, Mark Huasong Meng, Kailong Wang, Ling Shi, Li Li, Jun Wang, Haoyu Wang. "GlitchProber: Advancing Effective Detection and Mitigation of Glitch Tokens in Large Language Models." arXiv:2408.04905 (2024). DOI: 10.48550/arXiv.2408.04905. (Related DOI: 10.1145/3691620.3695060.)
-
Zihui Wu, Haichang Gao, Ping Wang, Shudong Zhang, Zhaoxiang Liu, Shiguo Lian. "GlitchMiner: Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization." arXiv:2410.15052 (2024--2025). DOI: 10.48550/arXiv.2410.15052.
-
[Optional / practitioner reference] garak documentation, glitch token probes module (for integrating glitch-token testing into broader LLM vulnerability scanning). [Add link here if you want to cite it.]
Appendix A: Report schema
Below is the structure used by Glitcher's mining output (glitch_tokens.json):
{
"model_path": "meta-llama/Llama-3.2-1B-Instruct",
"glitch_tokens": [
"useRalative",
"PostalCodesNL",
"ilmektedir"
],
"glitch_token_ids": [
89472,
127438,
115107
],
"total_iterations": 50,
"runtime_seconds": 61.57
}The genetic algorithm produces genetic_results.json with evolved combination details:
{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"base_text": "The quick brown",
"target_token": "fox",
"wanted_token": null,
"best_combination": {
"tokens": [
89472,
127438
],
"fitness": 0.95,
"baseline_prob": 0.82,
"modified_prob": 0.04,
"target_reduction": 0.78
},
"generations": 50,
"population_size": 100
}Individual token effectiveness is ranked in token_impact_baseline.json.
Appendix B: Reproducibility checklist
- Record model + tokenizer identifiers and exact versions.
- Log decoding settings (temperature/top-p/max tokens/stop sequences).
- Fix random seeds where supported; record non-deterministic components.
- Keep a stable set of neutral probe templates.
- Include a random control sample of non-candidate tokens.
- Document hardware/runtime differences if using open weights (GPU vs CPU, kernels, quantization).
- Record quantization level (int4/int8/fp16/bf16) when using local models.
- Document ASR threshold and number of validation attempts.
- Redact sensitive tokens in public reports; keep raw artifacts private.
- Include genetic algorithm parameters (population size, generations, mutation rate) if using combination search.