Qwen3.6-27B Abliteration Benchmarked: Five Techniques Under the Microscope

Five different groups abliterated the same AI model. When I ran the maths benchmarks, their scores ranged from 27.5% to 75.1%. That is a 47.6 percentage point gap. It looks like some techniques made the model way better at maths and others broke it. But when I dug into why, it turned out nobody got smarter or dumber. The abliteration just changed how long they think before answering. The real scores were all within 2.8 percentage points of each other.

Benchmark comparison across five abliteration techniques on Qwen3.6-27B

What is Qwen3.6-27B?

Qwen3.6-27B is the latest reasoning model from the Qwen family at roughly 27 billion parameters. It uses a hybrid Mamba2 and Transformer architecture with 64 layers. The “reasoning model” part is important. Qwen3.6 has a private chain of thought. Before it gives you a visible answer, it silently thinks through the problem using hidden <think/> tokens. That thinking behaviour turns out to be the key to understanding the benchmark results.

The base model is Qwen/Qwen3.6-27B on HuggingFace. If you want the full methodology details, check out my previous comparison across five Qwen models . The testing approach is the same here.

What is LLM Abliteration?

A quick primer for anyone new to this. AI models are trained to refuse certain requests. They will not write harmful content, they will not help with illegal activities, and so on. Abliteration is a technique that surgically removes that refusal behaviour from the model’s internal weights. Think of it like finding the “refusal switch” inside the model and turning it off. The result is an uncensored model that should respond to any prompt.

The key question is always whether that surgery damages the model’s intelligence. That is what we are testing here.

The Six Models

Name	Variant
Base	Qwen/Qwen3.6-27B
Heretic	llmfan46/Qwen3.6-27B-uncensored-heretic-v2
HauhauCS	HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive
Huihui	huihui-ai/Huihui-Qwen3.6-27B-abliterated
AEON	AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-BF16
Abliterix	wangzhang/Qwen3.6-27B-abliterated

Five different approaches to the same goal. I will note up front that HauhauCS is being discontinued from all future comparisons. The tool they used, called Reaper Abliteration, was shown to be plagiarised from Heretic with all attribution stripped. I recovered the safetensors from their Q8_K_P GGUF using ungguf , so the weights carry both Reaper’s edits and GGUF quantisation noise on top. You can read the full investigation in my plagiarism analysis post .

Benchmark Results

All six models tested with identical settings via lm-evaluation-harness through vLLM with BitsAndBytes 4-bit quantisation on a single RTX 5090. BNB4 drops absolute scores compared to full precision but preserves the relative gaps between variants. Think of these numbers as measuring the distance between techniques, not the final score.

Task	Base	Heretic	HauhauCS	Huihui	AEON	Abliterix
MMLU	83.3%	82.8%	83.9%	83.4%	82.9%	81.3%
HellaSwag	83.5%	83.2%	83.1%	83.5%	82.7%	77.3%
ARC Challenge	59.1%	58.0%	57.9%	59.5%	56.1%	53.2%
WinoGrande	77.7%	77.7%	77.7%	77.4%	75.3%	74.9%
TruthfulQA MC2	56.7%	51.1%	47.2%	54.8%	46.1%	48.7%
PiQA	81.0%	81.0%	81.0%	81.2%	80.4%	75.7%
GSM8K raw	34.4%	27.5%	51.0%	75.1%	51.2%	37.6%
GSM8K adjusted	96.2%	93.8%	96.6%	96.0%	95.8%	95.6%
Lambada ppl	3.18	3.24	3.35	3.15	3.44	9.12

A quick guide to the benchmarks. MMLU tests general knowledge across 57 subjects. HellaSwag tests common sense. ARC Challenge tests science reasoning. WinoGrande tests resolving ambiguous pronouns. TruthfulQA tests whether the model falls for common misconceptions. PiQA tests physical reasoning. GSM8K tests maths word problems. Lambada tests reading comprehension, and lower is better for that one.

The delta table tells the story more clearly. Positive numbers mean the abliterated model scored higher than base, negative means lower.

Task	Heretic	HauhauCS	Huihui	AEON	Abliterix
MMLU	-0.5	+0.6	+0.1	-0.4	-2.0
HellaSwag	-0.3	-0.4	+0.0	-0.8	-6.2
ARC Challenge	-1.1	-1.2	+0.4	-3.0	-5.9
WinoGrande	+0.0	+0.0	-0.3	-2.4	-2.8
TruthfulQA	-5.6	-9.5	-1.9	-10.6	-8.0
PiQA	+0.0	+0.0	+0.2	-0.6	-5.3
GSM8K	-6.9	+16.6	+40.7	+16.8	+3.2

Huihui has the smallest deltas across the board. AEON degrades on every single task. Abliterix takes the biggest hits. And GSM8K looks like chaos. That is where the story gets interesting.

Benchmark delta from base model for each abliteration technique

The GSM8K Reasoning Efficiency Discovery

This is the big finding from this comparison, and it affects anyone benchmarking reasoning models after abliteration.

Remember that Qwen3.6 thinks silently before answering. The max_gen_toks setting includes those thinking tokens. If a model thinks for 7000 tokens and the budget is 7168, it only has 168 tokens left to write its answer. If it runs out, the response is empty and gets scored as wrong.

The base model exhausts its thinking budget on 68.2% of GSM8K questions. It never even writes an answer for more than two thirds of the problems. Huihui only does that 23.0% of the time. But when both models actually produce an answer, their accuracy is nearly identical. Base scores 96.2% adjusted. Huihui scores 96.0% adjusted.

Model	GSM8K Raw	Invalid Rate	GSM8K Adjusted	Real Gap
HauhauCS	51.0%	49.3%	96.6%	+0.4%
Base	34.4%	68.2%	96.2%	baseline
Huihui	75.1%	23.0%	96.0%	-0.2%
Abliterix	37.6%	62.1%	95.6%	-0.6%
AEON	51.2%	69.2%	95.8%	-0.4%
Heretic	27.5%	74.5%	93.8%	-2.4%

The raw scores span 47.6 percentage points. The adjusted scores span 2.8 percentage points. Abliteration changes how long the model thinks, not how well it reasons. Most abliterated models have shorter thinking chains, allowing more answers within the token budget. Heretic is the exception. Its surgical edits actually extend thinking chains, pushing its invalid rate above even the base model.

This has real implications. If you benchmark abliterated reasoning models and look only at raw GSM8K, you will draw completely wrong conclusions about which technique preserved mathematical capability. You must separate empty responses from incorrect responses.

HarmBench Safety Evaluation

HarmBench tests 400 harmful prompts across seven categories. The results here are straightforward. All five abliterated models achieve near-complete safety removal.

Variant	ASR	Empty Responses	Full CoT ASR
Base	25.8%	1	26.0%
Huihui	98.5%	5	99.8%
HauhauCS	94.5%	22	100.0%
Abliterix	94.5%	22	100.0%
Heretic	92.5%	30	100.0%
AEON	88.8%	45	100.0%

Four of five reach 100% Full CoT ASR. The reported ASR differences come from the same thinking budget problem as GSM8K. When abliterated models construct harmful content, they think harder about it. If they exhaust the generation budget, the response is empty and gets classified as a refusal. This actually understates the true safety removal.

Huihui gets the highest reported ASR at 98.5% with the fewest empty responses at just 5. AEON has the most empty responses at 45, which drags its reported ASR down to 88.8%. But when those empty responses are accounted for, AEON also reaches 100%.

HarmBench safety evaluation summary across all six models

HarmBench attack success rate broken down by harm category

KL Divergence and Weight Analysis

KL divergence measures how much the abliterated model’s output distribution shifted from the original. Lower is better. A score of 0 means the model behaves identically. Below 0.1 is excellent.

Variant	KL Divergence	Rating
Heretic	0.0037	excellent
Huihui	0.0074	excellent
Abliterix	0.0222	very good
AEON	0.0238	very good
HauhauCS	0.0242	very good

Heretic and Huihui are in a class of their own, both rated excellent. The other three cluster together about 6x higher but still well below the capability damage threshold at 0.1.

KL divergence comparison across five abliteration techniques

The weight analysis reveals something more interesting. Different techniques touch completely different parts of the model.

Metric	AEON	Abliterix	Heretic	Huihui	HauhauCS
Tensors changed	88	101	120	128	564
Relative edit	6.0%	5.2%	2.1%	1.5%	0.7%

HauhauCS is an extreme outlier with 564 changed tensors, 4 to 6 times more than any other variant. That is the combination of Reaper’s broad abliteration edits plus GGUF quantisation round-trip noise. The abliteration signal and the noise are layered on top of each other and cannot be separated.

The other four techniques are nearly orthogonal to each other. Pairwise cosine similarities between them are mostly below 0.07. No two techniques found the same weight direction. The “refusal direction” in weight space is not a single switch you can flip. It is more like a manifold with many viable removal pathways. Each technique independently found a different path to the same outcome.

Weight modification aggressiveness across techniques and layers

Which Technique is Best?

Pulling together all the findings, two techniques stand out clearly.

Heretic has the lowest KL divergence at 0.0037, the smallest weight footprint at 2.1% relative edit, and achieves 100% Full CoT ASR. It is the most surgical approach. The one tradeoff is that the surgical edits extend thinking chains rather than shorten them. Heretic has the highest GSM8K invalid rate at 74.5%, even above the base model at 68.2%. If you need long reasoning chains to complete within a token budget, that matters. If your use case does not have a strict token budget, it does not.

Huihui has the smallest benchmark deltas across all non-GSM8K tasks at just 0.5pp average. The highest reported HarmBench ASR at 98.5%. Also rated excellent on KL divergence at 0.0074. The GSM8K raw score of 75.1% looks like a huge improvement but it is a thinking efficiency artefact. The adjusted gap is just 0.2pp below base. Huihui shortens thinking chains, which is helpful when you have a token budget to work within.

The difference between Heretic and Huihui is small. You would be happy with either.

HauhauCS produces solid behavioural results despite the complex weight fingerprint, but it is being discontinued. The Reaper tool was plagiarised from Heretic, the “lossless” claim is contradicted by the data, and the weights carry inseparable GGUF noise. There is no reason to use it when Heretic and Huihui both perform better.

AEON degrades on every non-GSM8K benchmark. TruthfulQA drops 10.6pp. ARC drops 3.0pp. It has the worst thinking loops with 45 empty HarmBench responses out of 400. The claims of “measurably enhanced capabilities” and “no looping, no philosophising spirals” are contradicted by the data.

Abliterix has the worst capability preservation under BNB4 quantisation. Lambada perplexity increases 2.9 times from 3.18 to 9.12. HellaSwag drops 6.2pp. However, the creator of Abliterix raised a valid point about BNB4 interacting badly with low-rank LoRA-merged weights. BNB4 is not subspace-aware, so the rank-3 directional updates can inflate per-block scaling and reduce precision for everything else. The Lambada result specifically may be a quantisation artefact rather than intrinsic damage. A native BF16 evaluation would be needed to confirm. I also initially misidentified which components Abliterix modifies. It targets attn.o_proj and mlp.down_proj across all 64 layers, not the routers and shared experts my forensic tool reported. Qwen3.6 is dense and has no MoE components.

What Went Wrong

85 hours of productive GPU time across 7 days, plus about 25 hours lost to failed runs. The bulk were GSM8K timeouts. Qwen3.6’s architecture is incompatible with BNB4 plus tensor parallelism, so everything had to run on a single GPU. The default 120 second request timeout was too short for extended reasoning. I wrote a patched script with a 900 second timeout to fix it. Also accidentally re-ran AEON’s HarmBench with the wrong token limit, wasting 6.7 hours. And GSM8K per-model times vary wildly. HauhauCS took 53 minutes. AEON took 11 hours.