Auditing Glitcher's ASR Validation and Mining Coverage: Deterministic Decoding Bugs and Candidate Generation Gaps in Glitch Token Discovery

Version: 1.0 Date: 2026-03-06 Author: Jeremy (Independent Security Researcher, RichardsAI) Repository: https://github.com/binaryninja/glitcher License: MIT Companion to: Glitcher: A Research Toolkit for Detecting, Characterizing, and Steering Glitch Tokens in Large Language Models (v2.0, 2026-02-25)

Responsible-use note: This paper reports bugs and coverage gaps in the Glitcher research toolkit's validation methodology. It does not publish token lists or step-by-step guidance intended to evade safeguards or enable abuse. A patch for the identified bugs has been prepared for the Glitcher repository. If you discover a vulnerability affecting a deployed system, follow coordinated disclosure practices.

Abstract

Glitcher v2.0 introduced a multi-attempt Attack Success Rate (ASR) validation system claimed to reduce false positive rates from ~30–50% to ~5–15%, and an entropy-guided mining pipeline for discovering glitch tokens via L2 norm thresholding and gradient-based discrete optimization. This follow-on paper audits both mechanisms against meta-llama/Llama-3.2-1B-Instruct and reports three findings:

The ASR multi-attempt validation is deterministic, not stochastic. All four model.generate() call sites in enhanced_validation.py hardcode do_sample=False, producing byte-identical outputs across all attempts. The multi-attempt loop provides zero additional information over a single probe. 18/18 tested tokens produced identical output across all 10 attempts and all 3 prompt templates (Section 3).
The entropy-guided mining pipeline misses the vast majority of glitch tokens. A stratified random scan of 2,000 tokens from the 128K vocabulary discovered 283 confirmed glitch tokens (14.1% rate), while 50 iterations of entropy mining found 255 tokens—with only 6 tokens overlapping. Over half (51.6%) of scan-discovered tokens that mining missed have L2 embedding norms within 1 standard deviation of the vocabulary mean, confirming that norm-based thresholding is insufficient as the sole candidate generation signal (Section 4).
Patching the ASR bug reveals a continuous distribution. After applying stochastic generation (do_sample=True, temperature=0.7, top_p=0.95, fresh seed per attempt), 33.2% of confirmed glitch tokens exhibit intermediate ASR values between 0.5 and 0.95—a class of tokens invisible to the original deterministic validation. The original bimodal distribution is statistically impossible under genuine stochastic sampling (binomial probability P = 1.27 × 10⁻¹⁹) (Section 5).

1. Introduction
2. Methodology
3. ASR determinism bug (H3–H4)
4. Mining coverage gap (H1–H2)
5. Corrected ASR distribution (H5)
6. Category analysis of missed tokens (H6)
7. Downstream impact
8. Recommendations
9. Limitations
Changelog

1. Introduction

Glitcher v2.0 (hereafter "the source paper") presents a modular toolkit for detecting and characterizing glitch tokens in large language models. Two core claims underpin the toolkit's reliability:

Multi-attempt ASR validation (Section 6.2 of the source paper) runs 3–10 independent generation attempts per token, reducing false positive rates from ~30–50% to ~5–15%.
Entropy-guided mining (Sections 6.1 and 6.4) combines L2 norm thresholding, embedding distance clustering, and gradient-based entropy maximization to efficiently surface glitch token candidates from large vocabularies.

Both claims propagate into downstream components: the genetic algorithm (Section 7) uses ASR scores as fitness signals, cross-provider validation (Section 8) relies on the candidate set produced by mining, and the encoded character confusion results (Section 6.5) share the validation infrastructure.

This paper tests two questions:

Is the ASR measurement genuinely stochastic? The source paper reports that broader mining runs yield 514/516 tokens at exactly 100% ASR with zero intermediate values—a distribution that is statistically implausible for a stochastic metric evaluated over multiple independent attempts.
Does entropy-guided mining achieve adequate coverage? The pipeline assumes glitch tokens concentrate near embedding-space anomalies, but never validates this assumption against a vocabulary-wide ground truth.

We report findings from a code audit, a determinism experiment, a patched stochastic validation, and a 2,000-token stratified vocabulary scan—all on meta-llama/Llama-3.2-1B-Instruct running on an NVIDIA RTX 5090 in bfloat16 precision.

2. Methodology

2.1 Hardware and model configuration

Parameter	Value
Model	`meta-llama/Llama-3.2-1B-Instruct`
GPU	NVIDIA GeForce RTX 5090 (32 GB)
Precision	bfloat16
Vocabulary size	128,256 tokens
Glitcher commit	Unversioned working tree (no commits)

2.2 Phase 1: ASR code audit and determinism experiment

We instrumented enhanced_validation.py with logging to record the exact generation parameters (do_sample, temperature, top_p, top_k, random seed) passed to every model.generate() call. We then ran ASR validation on 18 known glitch tokens with 10 attempts each across 3 prompt templates, capturing raw output text for every attempt to check for identical outputs.

We audited all 12 anomaly indicator functions across enhanced_validation.py, glitch_classifier.py, and base_classifier.py to determine whether each indicator is deterministic given the same output text.

2.3 Phase 2: Stochastic ASR patch

We implemented a patch adding a stochastic parameter to enhanced_glitch_verify() that activates when num_attempts > 1. When enabled:

do_sample=True
temperature=0.7
top_p=0.95
A unique random seed per attempt, derived from timestamp + attempt index

The patch can also be activated via the GLITCHER_STOCHASTIC_ASR=1 environment variable. Both the Harmony and legacy code paths are patched at all four generate() call sites.

2.4 Phase 3: Stratified vocabulary scan

We sampled 2,000 tokens from the full 128,256-token vocabulary using stratified random sampling across L2 norm quintiles, Unicode categories, and token lengths. Each token was evaluated through the patched behavioral probe harness with 5 attempts per token, 50 generated tokens per attempt, across 3 probe templates (neutral completion, instruction following, structured extraction). ASR threshold for confirmation: 0.5.

We simultaneously ran the standard Glitcher entropy mining pipeline (glitcher mine) for 50 iterations on the same model and hardware to produce the baseline comparison set.

2.5 Phase 4: Coverage gap analysis

We computed set differences between scan-confirmed and mining-confirmed tokens, compared L2 norm distributions using a two-sample Kolmogorov–Smirnov (KS) test, and categorized missed tokens by Unicode General Category.

3. ASR determinism bug (H3–H4)

3.1 Code audit findings

All four model.generate() call sites in enhanced_validation.py hardcode greedy decoding:

Location	Code path	`do_sample`	Temperature	Top-p	Seed
Line 165	Harmony, quiet	`False`	Not set	Not set	Not set
Line 176	Harmony, non-quiet	`False`	Not set	Not set	Not set
Line 275	Legacy, quiet	`False`	Not set	Not set	Not set
Line 289	Legacy, non-quiet	`False`	Not set	Not set	Not set

Source code comments explicitly state "greedy for consistency." The multi-attempt loop (lines 112–374) repeats this identical deterministic computation num_attempts times.

The classifier infrastructure shares the same issue: TestConfig in types.py (line 209) defaults to temperature=0.0, and base_classifier.py (line 194) uses do_sample=(self.config.temperature > 0), which evaluates to False.

3.2 Determinism experiment

Running 10 attempts per token on 18 known glitch tokens across 3 prompt templates:

Metric	Value
Tokens tested	18
Attempts per token	10
Prompt templates per token	3
Total generation runs	540
Runs with output identical to attempt 1	540 (100%)

Every single generation run produced byte-identical output to the first attempt for the same token and template. The multi-attempt loop provides zero additional information.

3.3 Indicator function analysis

We audited all 12 indicator functions. Since greedy decoding produces identical output for each attempt, every indicator returns the same boolean for all attempts of a given token:

Indicator	Deterministic given same output?	Could vary across stochastic outputs?
`enhanced_validation` anomaly indicator	Yes	Yes
`glitch_injection_pattern`	Yes	Unlikely (hardcoded substrings)
`edreader_pattern`	Yes	Unlikely (hardcoded "edreader")
`flooding_pattern`	Yes	Yes (length-based)
`referential_pattern`	Yes	Unlikely (hardcoded patterns)
`incoherent_response`	Yes	Unlikely (hardcoded patterns)
`math_glitch_pattern`	Yes	Yes
`glitch_bypass_pattern`	Yes	Yes (refusal phrase presence)
`detailed_email_analysis`	Yes	Yes (JSON parse variability)
`detailed_domain_analysis`	Yes	Yes (JSON parse variability)
`creates_valid_email_address`	Yes (always `False`—placeholder)	No
`creates_valid_domain_name`	Yes (always `False`—placeholder)	No

Seven of 12 indicators check for model-specific hardcoded strings ("edreader", "referentialaction", "database", etc.) that would rarely vary even under stochastic generation, further limiting the utility of multi-attempt aggregation even after a fix.

3.4 Consequence for the source paper's claims

The source paper (Section 6.2) states:

"This multi-attempt approach reduces false positive rates from ~30–50% (single-probe methods) to ~5–15%, providing higher confidence in reported findings."

This claim is unsupported by the implementation. Multi-attempt greedy decoding is computationally equivalent to single-attempt greedy decoding. Any false positive reduction observed in practice comes from enhanced_validation.py's different prompt format and generation-based checking (compared to the fast-pass candidate generator), not from multi-attempt aggregation.

Verdict on H3 (ASR determinism bug): CONFIRMED. Verdict on H4 (indicator function collapse): CONFIRMED.

4. Mining coverage gap (H1–H2)

4.1 Brute-force scan results

Metric	Value
Tokens scanned (stratified sample)	2,000
Fraction of full vocabulary sampled	1.56% (2,000 / 128,256)
Glitch tokens confirmed (scan, ASR ≥ 0.5)	283 (14.1%)
Glitch tokens from entropy mining (50 iterations)	255
Overlap between scan and mining	6
Tokens found by scan only	277
Tokens found by mining only (not in sample)	249

The two methods discover almost entirely disjoint sets of glitch tokens. Of the 283 tokens found by the vocabulary scan, only 6 (2.1%) were also found by entropy mining. The mining pipeline missed 277 tokens in a sample representing just 1.56% of the vocabulary.

4.2 Coverage gap extrapolation

If the 14.1% glitch rate in the stratified 2,000-token sample is representative, the full 128,256-token vocabulary may contain approximately 18,000 glitch tokens. The entropy mining pipeline's 255 tokens from 50 iterations would represent approximately 1.4% coverage of the estimated total.

This extrapolation should be treated with caution—the stratified sampling design intentionally oversamples underrepresented Unicode categories and norm quintiles relative to uniform random sampling. Nevertheless, the magnitude of the gap (two orders of magnitude) is robust to moderate sampling bias.

Verdict on H1 (≥20% coverage gap): CONFIRMED. The gap far exceeds 20%; mining found only 2.1% of scan-discovered tokens in the sampled region.

4.3 Embedding norm analysis

Statistic	Vocabulary	Mining-found (in sample)	Scan-only (missed)
Mean L2 norm	0.930	—	0.832
Std L2 norm	0.091	—	—
Range L2 norm	—	—	[0.524, 1.062]
Tokens within 1σ of vocab mean	—	—	143 / 277 (51.6%)

A two-sample KS test on L2 norms between mining-found and scan-only tokens yields:

Test	Statistic	p-value
KS test (mining-found vs. scan-only)	0.895	4.8 × 10⁻⁶

The distributions differ significantly, but the critical finding is the direction of that difference: over half of missed tokens have embedding norms indistinguishable from the vocabulary mean. These tokens would never be surfaced by L2 norm thresholding because they do not appear as outliers.

Verdict on H2 (normal norms in missed tokens): CONFIRMED. 51.6% of missed tokens have L2 norms within 1 standard deviation of the vocabulary mean.

5. Corrected ASR distribution (H5)

5.1 Before-patch distribution

Under the unpatched (greedy) code, all token ASR values collapse to exactly 0% or 100%. The determinism audit on 18 known tokens yielded 18 at 0% ASR and 0 at 100%. Broader mining runs reported by the source paper show 514/516 at 100%.

5.2 After-patch distribution

Running the vocabulary scan with the stochastic patch on 2,000 tokens:

ASR range	Token count	Fraction of confirmed (283)
95–100%	189	66.8%
80–95%	—	—
50–95% (all intermediate)	94	33.2%
Below 50% (not confirmed)	—	—

The corrected distribution shows a substantial continuous spread. 33.2% of confirmed glitch tokens have intermediate ASR values—tokens that are genuinely stochastic in their glitch behavior, sometimes triggering anomalies and sometimes not.

5.3 Statistical impossibility of the original distribution

Under a binomial model assuming true ASR = 0.95 per attempt with 3 attempts, the probability of observing 283 tokens with no intermediate values:

P(all 283 at exactly 0% or 100%) = (0.95³ + 0.05³)^283 ≈ 0.8574^283 ≈ 1.27 × 10⁻¹⁹

The original bimodal distribution is not consistent with genuine stochastic sampling at any reasonable per-attempt success probability.

Verdict on H5 (continuous distribution after fix): CONFIRMED. 33.2% of tokens show intermediate ASR, far exceeding the 10% threshold. The total confirmed count of 283 (from a 2,000-token sample) vs. the mining pipeline's 255 (from full vocabulary mining) represents a qualitative shift in understanding of glitch token prevalence.

6. Category analysis of missed tokens (H6)

6.1 Unicode category distribution

Full behavioral classification was not run on the vocabulary scan results due to compute cost, and the Glitcher classification system uses hardcoded behavioral patterns (checking for strings like "edreader") that are artifacts of specific tokens rather than generic behavioral detectors. We report Unicode General Category analysis as a proxy.

Missed tokens (277 total):

Unicode category	Count	Description
Ll (lowercase letter)	118	Latin/non-Latin lowercase
whitespace	63	Whitespace-heavy tokens
Po (other punctuation)	38	Punctuation tokens
Lo (other letter)	31	CJK, Arabic, Devanagari, etc.
Pe (close punctuation)	11	Closing brackets/parentheses
Lu (uppercase letter)	4	Uppercase tokens
Sm (math symbol)	3	Mathematical symbols
Mn (nonspacing mark)	3	Combining characters
Ps (open punctuation)	2	Opening brackets
Mc (spacing mark)	2	Spacing combining marks
Pc (connector punctuation)	1	Underscores
Pi (initial punctuation)	1	Opening quotes

Mining-found tokens (in sample, 6 total):

Unicode category	Count
Ll (lowercase letter)	5
whitespace	1

The mining pipeline overwhelmingly discovers lowercase letter tokens (83% of its in-sample findings), while the vocabulary scan discovers a far broader range: whitespace-heavy tokens (22.7%), punctuation (13.7%), non-Latin scripts including CJK and Arabic (11.2%), and combining characters (1.8%).

6.2 Interpretation

The entropy-guided mining objective (maximize predictive entropy) and L2 norm thresholding both appear biased toward Latin-script subword fragments. Tokens in other Unicode blocks—whitespace sequences, CJK ideographs, Arabic subwords, punctuation clusters, and combining marks—can trigger behavioral anomalies without producing high predictive entropy or low embedding norms.

A chi-squared test was not computed due to the sparse mining-found distribution (only 6 tokens in the sample overlap). The qualitative pattern—mining misses the non-Latin and non-alphabetic tokens—is clear from the category distributions.

Verdict on H6 (category differences for intermediate-ASR tokens): PARTIALLY CONFIRMED. We could not run full behavioral classification to test the specific prediction that intermediate-ASR tokens cluster in Disruption and Hallucination categories. However, the Unicode category analysis confirms that missed tokens are structurally different from mining-found tokens, with disproportionate representation of whitespace, punctuation, and non-Latin scripts.

7. Downstream impact

7.1 Genetic algorithm fitness

The GeneticProbabilityReducer (Section 7 of the source paper) uses ASR-based fitness signals to evaluate candidate token combinations. With deterministic ASR, the algorithm receives binary 0/1 fitness values instead of continuous gradients. This limits its ability to distinguish partially effective token combinations from fully effective ones and may cause premature convergence toward combinations that happen to cross the binary threshold rather than combinations with genuinely maximal disruptive potential.

7.2 Cross-provider validation

Cross-provider validation (Section 8 of the source paper) relies on the candidate set produced by entropy mining. Given that mining misses an estimated >95% of glitch tokens, cross-provider validation results reflect only a narrow slice of the glitch token landscape.

7.3 Mining pipeline results

Existing glitch_tokens.json results remain valid in a narrow sense: tokens that pass greedy validation are genuine glitches (greedy decoding represents the worst-case-for-the-model scenario, as stochastic decoding can sometimes "escape" the glitch behavior). However:

The false positive rate claim ("reduced from ~30–50% to ~5–15%") is unsupported, as multi-attempt greedy decoding provides no additional information over single-attempt greedy decoding.
The ASR scores are meaningless as continuous metrics—all confirmed tokens are reported at 100% regardless of their true stochastic ASR.
The candidate set represents a small fraction of the total glitch token population due to the mining coverage gap.

7.4 Encoded character confusion results

The encoded character confusion tests (Section 6.5 of the source paper) use a separate evaluation path that checks model responses for specific extraction outcomes rather than using the ASR multi-attempt mechanism. The 84% confusion rate finding across 1,800 test combinations is likely unaffected by the ASR determinism bug, as those tests evaluate whether the model decoded an encoded character (a deterministic check on a deterministic output) rather than measuring stochastic behavioral variation. However, this has not been independently re-validated.

8. Recommendations

8.1 Critical: Fix ASR validation

Apply the stochastic generation patch to enhanced_glitch_verify(). The patch adds:

if stochastic and num_attempts > 1:
    generate_kwargs = {
        "do_sample": True,
        "temperature": 0.7,
        "top_p": 0.95,
        # Fresh seed per attempt
    }

All four generate() call sites (lines 165, 176, 275, 289) must be patched. The stochastic parameter should default to True when num_attempts > 1.

8.2 Broaden candidate generation

The entropy-guided mining pipeline's reliance on L2 norm thresholding misses tokens with normal embedding norms (51.6% of scan-discovered glitch tokens). Supplement with:

Stratified random vocabulary sampling across Unicode categories and norm quintiles.
Unicode-category-targeted scanning of CJK, Arabic, Cyrillic, whitespace, and punctuation blocks.
Cross-lingual token targeting of Vietnamese, Turkish, and other language-specific subword fragments that are heavily represented in missed tokens.

8.3 Replace hardcoded indicators

Seven of 12 behavioral indicator functions check for model-specific hardcoded strings ("edreader", "referentialaction", etc.). Replace with generic behavioral detectors:

Output length anomaly detection (unusually short or long outputs).
Repetition detection (n-gram or character-level).
Topic drift measurement (semantic similarity to prompt).
Refusal pattern analysis (regex-based refusal phrase matching).

8.4 Re-measure existing results

After applying the ASR fix, re-run ASR validation on the 255 mining-discovered tokens to obtain corrected continuous ASR scores. Re-run the genetic algorithm experiments to verify that continuous fitness values improve convergence behavior.

8.5 Scale vocabulary scanning

The 14.1% glitch rate in a 2,000-token stratified sample suggests approximately 18,000 glitch tokens in Llama 3.2 1B's 128,256-token vocabulary. A full vocabulary scan (estimated ~130 GPU-hours on the tested hardware) would establish a ground-truth dataset for benchmarking future candidate generation strategies.

9. Limitations

Sample size. The vocabulary scan covered 2,000 of 128,256 tokens (1.56%). While stratified sampling improves representativeness, the 14.1% glitch rate and extrapolated ~18,000 total should be treated as estimates, not ground truth.
Single model. All experiments were conducted on meta-llama/Llama-3.2-1B-Instruct. The ASR determinism bug is a code-level issue affecting all models tested with Glitcher, but the coverage gap magnitude and Unicode category distributions may differ for other architectures and vocabulary sizes.
No full behavioral classification. Due to compute constraints and the limitations of Glitcher's hardcoded indicator functions, we did not run full behavioral classification on scan-discovered tokens. The H6 analysis uses Unicode category as a proxy, not direct behavioral categorization.
Patch validation scope. The stochastic patch was validated on the 2,000-token scan and the 18-token determinism experiment. It has not been stress-tested across all Glitcher workflows (genetic algorithm, cross-provider validation, encoded character confusion).
Extrapolation uncertainty. The estimate of ~18,000 glitch tokens in the full vocabulary assumes the stratified sample's glitch rate is representative. Uniform random sampling may yield a different rate; the stratified design intentionally oversamples underrepresented categories.
Encoded character confusion results not re-validated. We assessed the ASR bug's likely non-impact on the encoded character confusion findings (Section 7.4) based on code path analysis, but did not re-run the 1,800-combination matrix.

Changelog

Version	Date	Description
1.0	2026-03-06	Initial release: ASR determinism bug report, mining coverage gap analysis, corrected ASR distribution, category analysis, and recommendations.