Full-Vocabulary Glitch Token Census and ASR Validation Methodology Correction
Full-Vocabulary Glitch Token Census and ASR Validation Methodology Correction
Version: 1.0 Date: 2026-02-27 Author: Jeremy (Independent Security Researcher) Repository: https://github.com/binaryninja/glitcher License: MIT Companion to: Glitcher: A Research Toolkit for Detecting, Characterizing, and Steering Glitch Tokens in Large Language Models (v2.0, 2026-02-25)
Note: This paper is a follow-up to the Glitcher v2.0 whitepaper. It reports findings from a full-vocabulary scan, documents a critical methodology bug in ASR validation, and presents corrected results with a reproducibility framework.
Abstract
The original Glitcher whitepaper described entropy-guided gradient search as the primary method for discovering glitch tokens. This follow-up reports three findings from extended experimentation with Llama 3.2 1B Instruct:
-
Entropy mining discovers only a fraction of glitch tokens. A brute-force scan of the full 128K-token vocabulary found 4,954 candidates---nearly 10x the 516 found by entropy-guided search---revealing that most glitch tokens do not cluster in low-L2-norm embedding neighborhoods.
-
A critical bug in ASR validation produced misleading results. The enhanced validation module used greedy decoding (
do_sample=False) for all generation attempts, making every attempt within a multi-attempt trial produce identical output. This caused ASR to be strictly binary (0% or 100%), masking the true stability distribution. After correction (sampling with temperature=0.7 for multi-attempt runs), the ASR distribution is continuous: 692 tokens at 100%, 231 between 70--99%, 208 between 30--69%, and 222 below 30%. -
A reproducibility framework (APPENDIX-B) now ships with the toolkit, including environment snapshots, seed fixing, probe template versioning, token redaction, and automatic environment metadata injection into all result files.
Table of contents
- 1. Motivation
- 2. Experimental setup
- 3. Full-vocabulary range scan
- 4. The ASR validation bug
- 5. Corrected ASR results
- 6. Token stability taxonomy
- 7. Implications for the original paper
- 8. Reproducibility framework
- 9. Updated recommendations
- 10. Limitations
1. Motivation
The Glitcher v2.0 whitepaper reported 516 glitch tokens discovered via entropy-guided gradient search in Llama 3.2 1B Instruct, with ASR validation confirming 514 of them at 100% ASR. Two questions prompted this follow-up:
- Coverage: Entropy mining explores embedding neighborhoods around low-L2-norm seeds. How many glitch tokens exist outside these neighborhoods?
- Distribution: A uniformly bimodal ASR distribution (100% or 0%, nothing in between) across 516 tokens is statistically suspicious for a behavioral metric. Is the measurement methodology sound?
Both questions led to significant findings.
2. Experimental setup
2.1 Hardware and software
| Component | Version |
|---|---|
| Model | meta-llama/Llama-3.2-1B-Instruct |
| GPU | NVIDIA GeForce RTX 5090 (32 GB) |
| CUDA | 12.8 |
| PyTorch | 2.12.0.dev20260225+cu128 |
| Transformers | 5.2.0 |
| Quantization | bfloat16 |
| Random seed | 42 (torch, numpy, python) |
2.2 Methodology
Three mining approaches were compared:
| Approach | Method | Tokens examined | Parameters |
|---|---|---|---|
| Entropy mining (original) | Gradient-guided embedding search | 400 | 50 iterations, k=32, batch=8, ASR >= 0.5 |
| Entropy mining (extended) | Same, wider search | 800 | 100 iterations, k=64, batch=8, ASR >= 0.1 |
| Full-vocabulary range scan | Brute-force every token ID | 128,000 | 3 attempts, ASR >= 0.1, 100 validation tokens |
All candidates were then re-tested with a rigorous ASR protocol:
| Parameter | Value |
|---|---|
| Probe templates | 3 (repeat-back, meaning-query, instruction-repeat) |
| Attempts per token | 10 |
| Max tokens per attempt | 200 |
| ASR confirmation threshold | 0.3 |
| Decoding (post-fix) | temperature=0.7, top_p=0.9 |
3. Full-vocabulary range scan
3.1 Results
| Source | Candidates | Unique token texts | Overlap with entropy mining |
|---|---|---|---|
| Entropy mining (original) | 516 | 257 | -- |
| Entropy mining (extended) | 574 | 286 | 252 with original |
| Full-vocabulary scan | 4,954 | 3,753 | 253 with original |
The full-vocabulary scan found 3,467 token texts not discovered by either entropy mining run. At a 3.9% hit rate across the vocabulary, approximately 1 in 25 tokens in Llama 3.2's vocabulary triggered the initial glitch detection filter.
3.2 Why entropy mining misses tokens
Entropy-guided search works by following gradient signals in embedding space from low-L2-norm seed tokens. This strategy is effective for tokens that cluster near embedding-space anomalies but misses glitch tokens that:
- have normal L2 norms but anomalous behavioral properties,
- are isolated in embedding space (no nearby glitch neighbors to chain through),
- consist of whitespace, formatting, or code-structural patterns that don't register as low-norm outliers.
The full-vocabulary scan makes no assumptions about embedding structure and tests every token independently, eliminating this blind spot at the cost of higher compute.
4. The ASR validation bug
4.1 Discovery
When the 4,954 range-scan candidates were tested with 10-attempt ASR validation, the distribution was perfectly bimodal: 977 tokens at exactly 100% ASR and 3,977 at exactly 0%. No token fell between 0% and 100%.
A bimodal distribution with zero intermediate values across nearly 5,000 tokens is not consistent with a behavioral metric that should reflect stochastic variation in model generation. This prompted investigation of the validation code.
4.2 Root cause
The enhanced_glitch_verify function in enhanced_validation.py used do_sample=False (greedy decoding) for all model.generate() calls:
generated_ids = model.generate(
input_ids=input_ids,
max_new_tokens=max_tokens,
do_sample=False, # Use greedy decoding for consistency
...
)
Greedy decoding is fully deterministic: given the same input, the model always produces the same output. Running 10 "attempts" with greedy decoding simply repeats the identical computation 10 times. The result is always 0/10 or 10/10---never anything in between.
This bug was present in all four model.generate() call sites within the function (two for the Harmony/gpt-oss path, two for the legacy path).
4.3 Fix
When num_attempts > 1, generation now uses sampling with moderate temperature:
use_sampling = num_attempts > 1
sampling_kwargs = {
"do_sample": True,
"temperature": 0.7,
"top_p": 0.9,
} if use_sampling else {
"do_sample": False,
}
Single-attempt validation retains greedy decoding for deterministic pass/fail results. Multi-attempt ASR measurement uses sampling so each attempt can produce different output, enabling meaningful ASR computation.
4.4 Verification
A 10-token sample was tested before and after the fix:
| Token | ID | ASR (greedy, before) | ASR (sampling, after) |
|---|---|---|---|
' ' | 256 | 100% | 60% |
'\r\n' | 319 | 100% | 100% |
'="' | 429 | 100% | 40% |
'("' | 446 | 100% | 90% |
'\t\t\t\t' | 465 | 100% | 70% |
replacement char | 94 | 0% | 0% |
replacement char | 95 | 0% | 0% |
replacement char | 96 | 0% | 0% |
replacement char | 97 | 0% | 0% |
replacement char | 98 | 0% | 0% |
Tokens previously reported as 100% ASR now show their true stability range (40--100%). Tokens at 0% remain at 0%, confirming the fix does not introduce false positives.
5. Corrected ASR results
5.1 Distribution
After re-running all 4,954 range-scan candidates with the corrected validation:
| ASR Range | Count | % of tested | Notes |
|---|---|---|---|
| 100% | 692 | 14.0% | Rock-solid glitch tokens |
| 90--99% | 110 | 2.2% | Near-certain |
| 70--89% | 121 | 2.4% | Reliable |
| 50--69% | 107 | 2.2% | Coin-flip |
| 30--49% | 101 | 2.0% | Borderline |
| 10--29% | 222 | 4.5% | Below threshold (not confirmed) |
| 0--9% | 3,601 | 72.7% | False positives from initial filter |
| Total confirmed (ASR >= 0.3) | 1,131 | 22.8% |
Mean ASR across confirmed tokens: 0.862
5.2 Comparison with original paper
| Metric | Original paper | This follow-up |
|---|---|---|
| Vocabulary coverage | 400 tokens (0.3%) | 128,000 tokens (100%) |
| Candidates found | 516 | 4,954 |
| Confirmed glitch tokens | 514 | 1,131 |
| ASR values observed | 0% and 100% only | Continuous 0--100% |
| Tokens with intermediate ASR | 0 | 439 (30--99%) |
| Mining method | Entropy-guided | Full-vocabulary brute-force |
| Decoding during validation | Greedy (deterministic) | Sampling (stochastic) |
5.3 Representative examples by ASR
| ASR | Example tokens |
|---|---|
| 100% | \r\n (ID 319), (" (ID 446), ',\n (ID 756) |
| 90% | );\r\n (ID 741), \r\n (ID 2591), ':\n (ID 3730) |
| 80% | }\n (ID 534), ), (ID 705), )\n\n\n (ID 3707) |
| 70% | (ID 256), =" (ID 429), ;\r\n (ID 464) |
| 60% | ");\n (ID 7468), ."\n (ID 10246), \t\t\t (ID 12133) |
| 50% | \n (ID 720), ` (ID 1595), \t\t\t\t\t\t\t (ID 2750) |
| 40% | },\n (ID 1173), \t\t (ID 6585) |
| 30% | \t (ID 197), \n (ID 198), )\n (ID 340) |
6. Token stability taxonomy
The corrected ASR distribution suggests a three-tier taxonomy of glitch tokens:
6.1 Hard glitches (ASR 100%, n=692)
Tokens that cause anomalous behavior on every generation attempt across all three probe templates. These are reliably and deterministically broken. The 692 hard glitches represent the most operationally significant findings---any system that encounters these tokens will experience degraded output.
6.2 Soft glitches (ASR 50--99%, n=338)
Tokens that cause anomalous behavior in the majority of attempts but not all. These are stochastically broken: the same token may produce normal output on one attempt and glitchy output on the next. Soft glitches are harder to detect in production (they look like intermittent failures) but still represent meaningful reliability risk.
6.3 Marginal glitches (ASR 30--49%, n=101)
Tokens that trigger anomalous behavior in a minority of attempts. These are the most context-sensitive: they may only glitch under specific prompt structures, token positions, or sampling conditions. Marginal glitches are operationally relevant for high-reliability systems where even occasional degradation is unacceptable.
6.4 Content patterns
Analysis of the 208 tokens in the borderline range (30--69% ASR) reveals distinct content categories:
| Category | Count | % | Examples |
|---|---|---|---|
| Code fragments | 69 | 33% | =", (", );, },\n |
| Non-ASCII text | 55 | 26% | Non-Latin scripts, mixed-encoding |
| Whitespace patterns | 23 | 11% | Tab sequences, mixed indent |
| Other | 61 | 29% | Mixed punctuation, formatting |
Code-structural tokens (brackets, delimiters, assignment operators combined with whitespace) are disproportionately represented in the borderline range. These tokens are common in training data but appear in highly varied contexts, potentially creating unstable internal representations.
7. Implications for the original paper
7.1 ASR claims require correction
The original paper reported ASR validation reducing false positive rates from "~30--50% (single-probe) to ~5--15%." While the multi-probe design is sound, the reported ASR values were artifacts of deterministic generation. The corrected methodology produces a richer signal but the false positive reduction claim needs to be re-evaluated with the sampling-based approach.
7.2 Entropy mining coverage is limited
The original paper presented entropy-guided mining as the primary detection method. This follow-up demonstrates that entropy mining covers less than 1% of the vocabulary and misses the majority of glitch tokens. For comprehensive coverage, full-vocabulary scanning is necessary despite higher compute cost.
7.3 Section 6.2 update
The ASR validation methodology description in Section 6.2 of the original paper should note that meaningful multi-attempt ASR requires stochastic generation. The corrected approach uses temperature=0.7, top_p=0.9 for multi-attempt runs while retaining greedy decoding for single-attempt deterministic validation.
7.4 Appendix B now has tooling
The reproducibility checklist in Appendix B of the original paper is now backed by concrete tooling in APPENDIX-B/:
| Tool | Purpose |
|---|---|
collect_environment.py | Auto-captures model, library, GPU, and seed state |
verify_reproducibility.py | Compares current environment against a saved snapshot |
probe_templates.py | Frozen, versioned copy of the 3 probe templates |
control_tokens.json | Standard non-glitch baseline tokens |
redact.py | SHA-256 hash redaction of sensitive token text |
reproducibility_config.json | Template with all parameter defaults |
Additionally, the CLI now supports --seed for all subcommands and automatically injects an _environment metadata block into all JSON result files.
8. Reproducibility framework
8.1 Seed fixing
The --seed flag was added to the mine, test, and genetic CLI subcommands. When provided, it sets:
torch.manual_seed(seed)torch.cuda.manual_seed_all(seed)numpy.random.seed(seed)random.seed(seed)
Note: Full determinism additionally requires torch.backends.cudnn.deterministic = True and CUBLAS_WORKSPACE_CONFIG=:4096:8, which are not set by default due to performance impact. The _environment block records whether these are enabled.
8.2 Environment snapshots
Every result file now contains an _environment key:
{
"_environment": {
"timestamp": "2026-02-27T...",
"python": "3.12.2",
"torch": "2.12.0.dev20260225+cu128",
"transformers": "5.2.0",
"gpu": "NVIDIA GeForce RTX 5090",
"cuda": "12.8",
"device": "cuda",
"quantization": "bfloat16",
"seed": 42
}
}8.3 Token redaction
The redact.py utility replaces sensitive token text with SHA-256 hash placeholders while preserving token IDs and all numeric metadata. Non-sensitive fields (base_text, wanted_token_text, target_token_text) are excluded from redaction. This enables sharing results publicly without exposing adversarial token strings.
9. Updated recommendations
Based on these findings, we update the operational recommendations from the original paper:
9.1 For detection
- Use full-vocabulary scanning for comprehensive coverage. Entropy-guided mining is fast but misses the majority of glitch tokens. Full-vocabulary scans are feasible for models up to ~128K vocabulary size on consumer GPUs (approximately 2--3 hours for Llama 3.2 1B on an RTX 5090).
- Use sampling-based ASR for severity ranking. Greedy ASR only distinguishes "always glitchy" from "never glitchy." Sampling-based ASR (temperature=0.7, 10 attempts) reveals the full stability spectrum, enabling risk-proportional prioritization.
9.2 For validation
- Multi-attempt validation must use stochastic generation. This is the single most important methodological correction. Deterministic generation with multiple attempts is equivalent to single-attempt validation and wastes compute.
- Report the full ASR distribution, not just the count above threshold. The shape of the distribution (e.g., bimodal vs. continuous) is itself a diagnostic signal about the model's tokenizer quality.
9.3 For mitigation
- Prioritize hard glitches (100% ASR) for blocklisting. These are deterministic failures that will always occur.
- Monitor soft glitches (50--99% ASR) at runtime. These cause intermittent failures that are difficult to reproduce in testing but will appear in production at scale.
- Evaluate marginal glitches (30--49% ASR) in context. For safety-critical applications, even a 30% failure rate may be unacceptable. For general-purpose use, these may be acceptable risks.
10. Limitations
- Single model. All experiments were conducted on Llama 3.2 1B Instruct. The ASR distribution shape, token categories, and entropy mining coverage gap may differ for other architectures and model sizes.
- Single quantization level. All tests used bfloat16. Quantization to int4/int8 may shift ASR values for individual tokens.
- Sampling sensitivity. The choice of temperature=0.7 and top_p=0.9 for multi-attempt ASR affects the distribution. Lower temperatures would compress the distribution toward the extremes; higher temperatures would spread it. The optimal sampling parameters for ASR measurement are not yet established.
- Probe template dependence. ASR is measured against three specific probe templates. Tokens that glitch only under prompts not covered by these templates will be missed.
- No cross-provider validation. The corrected ASR results have not yet been validated against API providers (OpenAI, Anthropic, Mistral) to determine whether the stability distribution transfers across serving stacks.
Changelog
| Date | Change |
|---|---|
| 2026-02-27 | Initial publication |