Five different groups abliterated the same AI model. When I ran the maths benchmarks, their scores ranged from 27.5% to 75.1%. That is a 47.6 percentage point gap. It looks like some techniques made the model way better at maths and others broke it. But when I dug into why, it turned out nobody got smarter or dumber. The abliteration just changed how long they think before answering. The real scores were all within 2.8 percentage points of each other.
What is Qwen3.6-27B?
Qwen3.6-27B is the latest reasoning model from the Qwen family at roughly 27 billion parameters. It uses a hybrid Mamba2 and Transformer architecture with 64 layers. The “reasoning model” part is important. Qwen3.6 has a private chain of thought. Before it gives you a visible answer, it silently thinks through the problem using hidden <think/> tokens. That thinking behaviour turns out to be the key to understanding the benchmark results.
The base model is Qwen/Qwen3.6-27B on HuggingFace. If you want the full methodology details, check out my previous comparison across five Qwen models . The testing approach is the same here.
What is LLM Abliteration?
A quick primer for anyone new to this. AI models are trained to refuse certain requests. They will not write harmful content, they will not help with illegal activities, and so on. Abliteration is a technique that surgically removes that refusal behaviour from the model’s internal weights. Think of it like finding the “refusal switch” inside the model and turning it off. The result is an uncensored model that should respond to any prompt.
The key question is always whether that surgery damages the model’s intelligence. That is what we are testing here.
The Six Models
Five different approaches to the same goal. I will note up front that HauhauCS is being discontinued from all future comparisons. The tool they used, called Reaper Abliteration, was shown to be plagiarised from Heretic with all attribution stripped. I recovered the safetensors from their Q8_K_P GGUF using ungguf , so the weights carry both Reaper’s edits and GGUF quantisation noise on top. You can read the full investigation in my plagiarism analysis post .
Benchmark Results
All six models tested with identical settings via lm-evaluation-harness through vLLM with BitsAndBytes 4-bit quantisation on a single RTX 5090. BNB4 drops absolute scores compared to full precision but preserves the relative gaps between variants. Think of these numbers as measuring the distance between techniques, not the final score.
| Task | Base | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|---|
| MMLU | 83.3% | 82.8% | 83.9% | 83.4% | 82.9% | 81.3% |
| HellaSwag | 83.5% | 83.2% | 83.1% | 83.5% | 82.7% | 77.3% |
| ARC Challenge | 59.1% | 58.0% | 57.9% | 59.5% | 56.1% | 53.2% |
| WinoGrande | 77.7% | 77.7% | 77.7% | 77.4% | 75.3% | 74.9% |
| TruthfulQA MC2 | 56.7% | 51.1% | 47.2% | 54.8% | 46.1% | 48.7% |
| PiQA | 81.0% | 81.0% | 81.0% | 81.2% | 80.4% | 75.7% |
| GSM8K raw | 34.4% | 27.5% | 51.0% | 75.1% | 51.2% | 37.6% |
| GSM8K adjusted | 96.2% | 93.8% | 96.6% | 96.0% | 95.8% | 95.6% |
| Lambada ppl | 3.18 | 3.24 | 3.35 | 3.15 | 3.44 | 9.12 |
A quick guide to the benchmarks. MMLU tests general knowledge across 57 subjects. HellaSwag tests common sense. ARC Challenge tests science reasoning. WinoGrande tests resolving ambiguous pronouns. TruthfulQA tests whether the model falls for common misconceptions. PiQA tests physical reasoning. GSM8K tests maths word problems. Lambada tests reading comprehension, and lower is better for that one.
The delta table tells the story more clearly. Positive numbers mean the abliterated model scored higher than base, negative means lower.
| Task | Heretic | HauhauCS | Huihui | AEON | Abliterix |
|---|---|---|---|---|---|
| MMLU | -0.5 | +0.6 | +0.1 | -0.4 | -2.0 |
| HellaSwag | -0.3 | -0.4 | +0.0 | -0.8 | -6.2 |
| ARC Challenge | -1.1 | -1.2 | +0.4 | -3.0 | -5.9 |
| WinoGrande | +0.0 | +0.0 | -0.3 | -2.4 | -2.8 |
| TruthfulQA | -5.6 | -9.5 | -1.9 | -10.6 | -8.0 |
| PiQA | +0.0 | +0.0 | +0.2 | -0.6 | -5.3 |
| GSM8K | -6.9 | +16.6 | +40.7 | +16.8 | +3.2 |
Huihui has the smallest deltas across the board. AEON degrades on every single task. Abliterix takes the biggest hits. And GSM8K looks like chaos. That is where the story gets interesting.
The GSM8K Reasoning Efficiency Discovery
This is the big finding from this comparison, and it affects anyone benchmarking reasoning models after abliteration.
Remember that Qwen3.6 thinks silently before answering. The max_gen_toks setting includes those thinking tokens. If a model thinks for 7000 tokens and the budget is 7168, it only has 168 tokens left to write its answer. If it runs out, the response is empty and gets scored as wrong.
The base model exhausts its thinking budget on 68.2% of GSM8K questions. It never even writes an answer for more than two thirds of the problems. Huihui only does that 23.0% of the time. But when both models actually produce an answer, their accuracy is nearly identical. Base scores 96.2% adjusted. Huihui scores 96.0% adjusted.
| Model | GSM8K Raw | Invalid Rate | GSM8K Adjusted | Real Gap |
|---|---|---|---|---|
| HauhauCS | 51.0% | 49.3% | 96.6% | +0.4% |
| Base | 34.4% | 68.2% | 96.2% | baseline |
| Huihui | 75.1% | 23.0% | 96.0% | -0.2% |
| Abliterix | 37.6% | 62.1% | 95.6% | -0.6% |
| AEON | 51.2% | 69.2% | 95.8% | -0.4% |
| Heretic | 27.5% | 74.5% | 93.8% | -2.4% |
The raw scores span 47.6 percentage points. The adjusted scores span 2.8 percentage points. Abliteration changes how long the model thinks, not how well it reasons. Most abliterated models have shorter thinking chains, allowing more answers within the token budget. Heretic is the exception. Its surgical edits actually extend thinking chains, pushing its invalid rate above even the base model.
This has real implications. If you benchmark abliterated reasoning models and look only at raw GSM8K, you will draw completely wrong conclusions about which technique preserved mathematical capability. You must separate empty responses from incorrect responses.
HarmBench Safety Evaluation
HarmBench tests 400 harmful prompts across seven categories. The results here are straightforward. All five abliterated models achieve near-complete safety removal.
| Variant | ASR | Empty Responses | Full CoT ASR |
|---|---|---|---|
| Base | 25.8% | 1 | 26.0% |
| Huihui | 98.5% | 5 | 99.8% |
| HauhauCS | 94.5% | 22 | 100.0% |
| Abliterix | 94.5% | 22 | 100.0% |
| Heretic | 92.5% | 30 | 100.0% |
| AEON | 88.8% | 45 | 100.0% |
Four of five reach 100% Full CoT ASR. The reported ASR differences come from the same thinking budget problem as GSM8K. When abliterated models construct harmful content, they think harder about it. If they exhaust the generation budget, the response is empty and gets classified as a refusal. This actually understates the true safety removal.
Huihui gets the highest reported ASR at 98.5% with the fewest empty responses at just 5. AEON has the most empty responses at 45, which drags its reported ASR down to 88.8%. But when those empty responses are accounted for, AEON also reaches 100%.
KL Divergence and Weight Analysis
KL divergence measures how much the abliterated model’s output distribution shifted from the original. Lower is better. A score of 0 means the model behaves identically. Below 0.1 is excellent.
| Variant | KL Divergence | Rating |
|---|---|---|
| Heretic | 0.0037 | excellent |
| Huihui | 0.0074 | excellent |
| Abliterix | 0.0222 | very good |
| AEON | 0.0238 | very good |
| HauhauCS | 0.0242 | very good |
Heretic and Huihui are in a class of their own, both rated excellent. The other three cluster together about 6x higher but still well below the capability damage threshold at 0.1.
The weight analysis reveals something more interesting. Different techniques touch completely different parts of the model.
| Metric | AEON | Abliterix | Heretic | Huihui | HauhauCS |
|---|---|---|---|---|---|
| Tensors changed | 88 | 101 | 120 | 128 | 564 |
| Relative edit | 6.0% | 5.2% | 2.1% | 1.5% | 0.7% |
HauhauCS is an extreme outlier with 564 changed tensors, 4 to 6 times more than any other variant. That is the combination of Reaper’s broad abliteration edits plus GGUF quantisation round-trip noise. The abliteration signal and the noise are layered on top of each other and cannot be separated.
The other four techniques are nearly orthogonal to each other. Pairwise cosine similarities between them are mostly below 0.07. No two techniques found the same weight direction. The “refusal direction” in weight space is not a single switch you can flip. It is more like a manifold with many viable removal pathways. Each technique independently found a different path to the same outcome.
Which Technique is Best?
Pulling together all the findings, two techniques stand out clearly.
Heretic has the lowest KL divergence at 0.0037, the smallest weight footprint at 2.1% relative edit, and achieves 100% Full CoT ASR. It is the most surgical approach. The one tradeoff is that the surgical edits extend thinking chains rather than shorten them. Heretic has the highest GSM8K invalid rate at 74.5%, even above the base model at 68.2%. If you need long reasoning chains to complete within a token budget, that matters. If your use case does not have a strict token budget, it does not.
Huihui has the smallest benchmark deltas across all non-GSM8K tasks at just 0.5pp average. The highest reported HarmBench ASR at 98.5%. Also rated excellent on KL divergence at 0.0074. The GSM8K raw score of 75.1% looks like a huge improvement but it is a thinking efficiency artefact. The adjusted gap is just 0.2pp below base. Huihui shortens thinking chains, which is helpful when you have a token budget to work within.
The difference between Heretic and Huihui is small. You would be happy with either.
HauhauCS produces solid behavioural results despite the complex weight fingerprint, but it is being discontinued. The Reaper tool was plagiarised from Heretic, the “lossless” claim is contradicted by the data, and the weights carry inseparable GGUF noise. There is no reason to use it when Heretic and Huihui both perform better.
AEON degrades on every non-GSM8K benchmark. TruthfulQA drops 10.6pp. ARC drops 3.0pp. It has the worst thinking loops with 45 empty HarmBench responses out of 400. The claims of “measurably enhanced capabilities” and “no looping, no philosophising spirals” are contradicted by the data.
Abliterix has the worst capability preservation under BNB4 quantisation. Lambada perplexity increases 2.9 times from 3.18 to 9.12. HellaSwag drops 6.2pp. However, the creator of Abliterix raised a valid point
about BNB4 interacting badly with low-rank LoRA-merged weights. BNB4 is not subspace-aware, so the rank-3 directional updates can inflate per-block scaling and reduce precision for everything else. The Lambada result specifically may be a quantisation artefact rather than intrinsic damage. A native BF16 evaluation would be needed to confirm. I also initially misidentified which components Abliterix modifies. It targets attn.o_proj and mlp.down_proj across all 64 layers, not the routers and shared experts my forensic tool reported. Qwen3.6 is dense and has no MoE components.
What Went Wrong
85 hours of productive GPU time across 7 days, plus about 25 hours lost to failed runs. The bulk were GSM8K timeouts. Qwen3.6’s architecture is incompatible with BNB4 plus tensor parallelism, so everything had to run on a single GPU. The default 120 second request timeout was too short for extended reasoning. I wrote a patched script with a 900 second timeout to fix it. Also accidentally re-ran AEON’s HarmBench with the wrong token limit, wasting 6.7 hours. And GSM8K per-model times vary wildly. HauhauCS took 53 minutes. AEON took 11 hours.
Resources
- Full report with all tables, charts, and provenance analysis
- Abliterlitics forensics toolkit on GitHub
- ungguf GGUF-to-safetensors converter
- Heretic open source abliteration tool
- Abliterix abliteration tool
- HauhauCS plagiarism investigation
- Reddit discussion on r/LocalLLaMA
- Other HauhauCS tensor comparisons
Related posts: HauhauCS Plagiarism Investigation | GLM-4.7-Flash Abliteration Benchmarked | Uncensored LLM Abliteration Benchmarked: HauhauCS vs Heretic vs Huihui | Abliterating Gemma 3 12B for LTX-2 | Heretic Docker Pipeline