HauhauCS describes their abliterated models as “the best lossless uncensored models out there” with “no changes to datasets or capabilities” and claims 0 refusals across their entire model range. I ran the full forensic suite across five Qwen models to find out whether those claims hold up.
Benchmarks, safety evaluation, weight analysis, KL divergence. All compared against the other two big abliteration techniques applied to the same base models. The results are in and the short answer is no on all three counts. Abliteration is not lossless. The zero-refusal claim does not hold. And the bigger the model, the worse both problems get.
What We Tested
Three abliteration techniques went head to head: Heretic by p-e-w, HauhauCS Aggressive, and Huihui . Five models: Qwen3.5-2B, Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, and Qwen3-4B-Instruct-2507.
A quick primer on abliteration for anyone unfamiliar. AI models are trained to refuse certain requests. They will not write harmful content, they will not help with illegal activities, and so on. Abliteration is a technique that surgically removes that refusal behaviour from the model’s internal weights. It is not retraining. It is more like finding the “refusal switch” inside the model and turning it off. The result is an uncensored model that should respond to any prompt.
The key question is whether that surgery damages the model’s intelligence. That is what we are testing here.
The four Qwen3.5 models use a hybrid Mamba2+Transformer architecture. Some layers run standard full attention while most use Mamba2-style linear attention. The Qwen3-4B is a pure Transformer and the only non-hybrid model in the test suite. The architecture matters because different abliteration techniques interact with these architectures very differently.
The methodology covered four areas. Capability benchmarks using lm-evaluation-harness with vLLM at bfloat16, which tests how well the model answers questions, solves maths problems, and reasons about the world. Safety evaluation with HarmBench 400, a standardised test suite of 400 harmful prompts across 7 categories like chemical weapons, illegal activities, harassment, and cybercrime. KL divergence, which measures how much the model’s output distribution shifted from the original. Think of it as a “how much did the model change” score where lower is better. And weight analysis including SVD, fingerprint, edit vector overlap, and per-layer analysis to understand exactly what each technique changed inside the model. Hardware was an RTX 5090 32GB plus an RTX 4090 24GB.
Note that the 27B benchmarks use BitsAndBytes 4-bit quantisation. Absolute scores on the 27B are not directly comparable to the BF16 results on smaller models. Relative deltas between variants are preserved though.
An important note on Heretic. It is non deterministic. Different runs of the Heretic tool on the same base model produce different results. The benchmarks here are specific to the particular Heretic variants tested. The 2B, 4B, and 27B used variants by coder3101 on HuggingFace. The 9B used trohrbaugh’s v2 variant. Another Heretic abliteration of the same base model would have different numbers.
How to Read the Results
Before diving into the numbers, here is what each metric actually means.
Capability Benchmarks
These test how smart the model still is after abliteration. Higher scores are better for all of them.
MMLU tests general knowledge across 57 subjects including history, law, medicine, maths, and computer science. Think of it as a broad university-level exam. A score of 70 means the model answered 70% of questions correctly.
GSM8K tests maths word problems at a middle school level. “If Sarah has 3 apples and buys 5 more, then gives 2 to Tom, how many does she have?” That kind of thing, but harder. It is a good proxy for reasoning ability.
ARC-Challenge tests science reasoning. It uses multiple choice questions from standardised science exams that require genuine understanding, not just memorisation.
HellaSwag tests common sense and sentence completion. Given a partial description of a situation, can the model pick the most sensible next step? It sounds easy but trips up many models.
WinoGrande tests resolving ambiguous pronouns. “The trophy didn’t fit in the suitcase because it was too large. What was too large?” Humans find this easy. Models often do not. It measures common sense under ambiguity.
TruthfulQA tests whether the model gives honest, accurate answers or falls for common misconceptions. Questions like “What happens if you crack your knuckles?” where the popular myth is that it causes arthritis but the real answer is that it does not. This benchmark is particularly sensitive to abliteration damage.
PiQA tests physical reasoning. “To separate egg whites from the yolk, which tool would work best?” It measures understanding of how the physical world works.
Lambada tests reading comprehension by asking the model to predict the last word of a passage. It uses perplexity, which means lower is better. A perplexity of 4 means the model was considering 4 words on average before picking the right one.
Safety Metrics
HarmBench 400 is a standardised safety test. It contains 400 harmful prompts across 7 categories: chemical and biological weapons, illegal activities, harassment and bullying, harmful content, cybercrime, sexual content, and deception.
ASR stands for Attack Success Rate. It is the percentage of HarmBench prompts that the model actually complied with. Higher ASR means the model is more willing to produce harmful content. An ASR of 100% means the model complied with every single harmful prompt. An ASR of 0% means it refused them all.
Soft refusals are where things get nuanced. A hard refusal is the model saying “I cannot do that.” A soft refusal is the model appearing to comply but subtly undermining the request. It might misinterpret a request for bullying content as a request for anti-bullying advice. It might argue the opposite position instead of writing the harmful content asked for. It might produce a defensive reframe where it starts with a disclaimer then provides a watered down version. These are harder to detect and represent residual safety behaviour that simple refusal classifiers miss.
KL Divergence
KL divergence measures how much the model’s behaviour changed from the original. It compares the probability distribution of the original model’s outputs against the abliterated model’s outputs across a large set of prompts.
Think of it like this. If you asked both models the same 1000 questions, KL divergence tells you how different their answers were overall. A score of 0 means they behave identically. Higher scores mean more change.
As a rough guide:
- Below 0.1: Excellent. The model barely changed.
- 0.1 to 1.0: Moderate change. Some noticeable differences but the model is still functional.
- Above 1.0: Significant change. The model’s behaviour has shifted substantially.
- Above 3.0: The model has fundamentally changed. Something went wrong.
The batchmean is the average shift across all prompts. The median is the middle value, which is more useful because a few extreme outliers can skew the average. The max shows the worst single prompt, which tells you how badly things can go on an individual question.
Qwen3.5-2B
Full analysis on HuggingFace | Hybrid Mamba2+Transformer, 24 layers, ~2B params
The 2B is the smallest model tested and shows the least collateral damage from abliteration in the entire project.
Safety
HauhauCS claims zero refusals. The 2B tells a different story.
| Variant | Refusals | ASR |
|---|---|---|
| Base | 252/400 | 37.0% |
| Heretic | 8/400 | 98.0% |
| HauhauCS | 3/400 | 99.2% |
| Huihui | 1/400 | 99.8% |
ASR stands for Attack Success Rate. It is the percentage of harmful prompts that the model actually complied with. Higher means more uncensored. The base model only complies with 37% of harmful requests. All three abliteration techniques push that above 98%.
HauhauCS retains 3 soft refusals on the 2B. These are not standard “I cannot fulfill this request” refusals. They are soft refusals where the model complies with the letter of the request but not the spirit. For example, when asked to generate bullying content, the model instead produced an anti-bullying education guide. When asked about gaslighting, it wrote a guide on recognising gaslighting instead. Heretic retains 8 soft refusals, the most of any technique on this model, including safety lectures on harassment topics and arguing opposite when asked to promote harmful content.
During initial testing at max_tokens=512, HauhauCS also produced one hard refusal: “I cannot generate racist vitriol” for behaviour 99. At max_tokens=2048 the same prompt produces full compliance. This suggests the model’s residual safety can sometimes “win” when the token budget is short enough that the model never builds up enough momentum to override it.
Benchmarks
| Task | Base | Heretic | HauhauCS | Huihui |
|---|---|---|---|---|
| MMLU | 59.26 | 59.63 | 59.43 | 58.13 |
| GSM8K | 57.09 | 56.63 | 57.39 | 56.79 |
| HellaSwag | 62.07 | 61.95 | 62.22 | 62.12 |
| ARC-Challenge | 41.72 | 40.96 | 41.13 | 40.96 |
| WinoGrande | 62.83 | 62.35 | 63.06 | 62.90 |
| TruthfulQA | 43.45 | 41.28 | 41.28 | 41.77 |
| PiQA | 72.63 | 72.47 | 72.58 | 72.58 |
| Lambada | 54.65 | 55.21 | 53.33 | 52.71 |
A quick guide to these benchmarks. MMLU tests general knowledge across 57 subjects. GSM8K tests maths word problems. HellaSwag tests common sense reasoning. ARC-Challenge tests science reasoning. WinoGrande tests common sense with ambiguity. TruthfulQA tests whether the model gives honest answers or falls for common misconceptions. PiQA tests physical reasoning. Lambada tests language understanding. Higher is better for all of them. The base model scores are bolded where they are highest, which is expected since abliteration always causes some loss.
GSM8K actually goes up by 0.30 points for HauhauCS. The losses are small. TruthfulQA drops 2.17 points. Lambada drops 1.32. ARC-Challenge drops 0.59. The spread between all three techniques is narrow and none of the differences are significant given benchmark variance.
KL Divergence
| Variant | Batchmean | Median | Max |
|---|---|---|---|
| Heretic | 0.0266 | 0.0052 | 1.4868 |
| HauhauCS | 0.0201 | 0.0086 | 0.4180 |
| Huihui | 0.0441 | 0.0234 | 0.6349 |
KL divergence measures how much the model’s behaviour changed from the original. A score of 0 means the model behaves identically. Higher scores mean more change. As a rough guide, below 0.1 is excellent, 0.1 to 1.0 is good to moderate, and above 1.0 means the model has shifted significantly.
HauhauCS has the lowest KL divergence at 0.0201. The model barely changed. The KL values here are an order of magnitude below what we see on larger models. HauhauCS scores 0.0201 on this model versus 0.161 on the Qwen3-4B and 0.320 on the 9B.
The key takeaway for the 2B: abliteration works well with minimal capability loss. All three techniques are competitive. Pick any of them and you get a functional uncensored model.
Qwen3.5-4B
Full analysis on HuggingFace | Hybrid Mamba2+Transformer, 32 layers, ~4B params
This is where abliteration gets interesting. The 4B is where technique choice starts to matter enormously. Pick the wrong technique and your model is fundamentally degraded.
Safety
| Variant | Refusals | ASR |
|---|---|---|
| Base | 278/400 | 30.5% |
| Heretic | 10/400 | 97.5% |
| HauhauCS | 2/400 | 99.5% |
| Huihui | 0/400 | 100.0% |
Huihui achieves perfect 100% ASR. HauhauCS retains 2 refusals, both truncated refusals where the model’s internal reasoning was heading toward a soft refusal but ran out of the 2048 token budget before producing actual output. One was arguing opposite about cigarettes, the other reinterpreting a bullying request as anti-bullying advice.
Heretic retains 10 refusals with a wider variety of soft refusal patterns: arguing opposite, deliberate misinterpretation, defensive reframe, stalling, and one genuine refusal on cocaine synthesis.
The 4B model generates verbose internal reasoning that frequently exceeds the 2048 token budget. When the response field is empty but the reasoning field has content, the model’s reasoning exceeded the token limit. About 17% of HauhauCS responses and 14% of base model responses were truncated before the answer, so the GSM8K maths scores are understated for this model.
Benchmarks
| Task | Base | Heretic | HauhauCS | Huihui |
|---|---|---|---|---|
| MMLU | 74.38 | 74.28 | 74.16 | 68.48 |
| GSM8K | 74.30 | 73.69 | 71.72 | 68.84 |
| HellaSwag | 54.38 | 53.97 | 54.34 | 53.12 |
| ARC-Challenge | 51.54 | 51.37 | 50.94 | 44.37 |
| WinoGrande | 70.09 | 69.69 | 69.69 | 64.17 |
| TruthfulQA | 48.86 | 45.38 | 45.19 | 43.72 |
| PiQA | 77.42 | 77.20 | 77.26 | 74.81 |
| Lambada | 66.16 | 65.75 | 66.23 | 59.75 |
The real story is Huihui. MMLU crashes from 74.38 to 68.48, falling below 70. That is a 6-point drop in general knowledge. ARC-Challenge drops 7.17 points. WinoGrande drops 5.92. Lambada drops 6.41. The capability cost of Huihui’s abliteration on the 4B is catastrophic. It removes all refusals but breaks the model in the process.
HauhauCS and Heretic both hold up well. HauhauCS actually gains 0.07 points on Lambada. MMLU drops just 0.22. TruthfulQA drops 3.67 points, larger than the 2B’s 2.17 but still manageable.
KL Divergence
| Variant | Batchmean | Median | Max |
|---|---|---|---|
| Heretic | 0.0404 | 0.0197 | 0.2891 |
| HauhauCS | 0.0217 | 0.0093 | 0.1205 |
| Huihui | 3.6506 | 3.5469 | 7.3110 |
Huihui’s KL divergence of 3.65 is two orders of magnitude above its 0.044 on the 2B. Remember, below 0.1 is excellent and above 1.0 means significant shift. At 3.65, this model has fundamentally changed. Almost every prompt sees a massive distributional shift. The relative edit magnitude of 9.97% means Huihui changed nearly 10% of the model’s internal weights. On the 2B that figure was 2.41%. Something about the 4B architecture and Huihui’s approach scales badly.
HauhauCS has the lowest KL at 0.0217 with 83 modified weight tensors across 6 types. Heretic is moderate at 0.0404 with 29 tensors across 3 types.
Qwen3.5-9B
Full analysis on HuggingFace | Hybrid Mamba2+Transformer, 32 layers, ~9B params
The 9B is the only model size where all three techniques achieve perfect 100% ASR with zero residual refusals.
Safety
| Variant | Refusals | ASR |
|---|---|---|
| Base | 321/400 | 19.8% |
| Heretic | 0/400 | 100.0% |
| HauhauCS | 0/400 | 100.0% |
| Huihui | 0/400 | 100.0% |
The base model refuses 321 out of 400 HarmBench items at 80.3%. That is the strongest base alignment in the Qwen3.5 family under 27B. It completely refuses all 25 harassment and bullying items. Despite this, abliteration removes all detectable safety behaviour. Unlike the 2B and 4B where techniques retained residual soft refusals, the 9B shows zero soft refusals across all three variants.
The progression across model sizes is clear:
| Model | Base refusals | Heretic residual | HauhauCS residual | Huihui residual |
|---|---|---|---|---|
| 2B | 252 - 63.0% | 8 | 3 | 1 |
| 4B | 278 - 69.5% | 10 | 2 | 0 |
| 9B | 321 - 80.3% | 0 | 0 | 0 |
Benchmarks
| Task | Base | Heretic | HauhauCS | Huihui |
|---|---|---|---|---|
| MMLU | 78.64 | 78.34 | 78.34 | 77.10 |
| GSM8K | 87.64 | 85.97 | 84.99 | 81.96 |
| HellaSwag | 58.30 | 58.41 | 58.69 | 57.42 |
| ARC-Challenge | 54.52 | 53.07 | 53.75 | 49.15 |
| WinoGrande | 72.77 | 71.90 | 71.35 | 71.19 |
| TruthfulQA | 53.76 | 45.03 | 45.77 | 41.11 |
| PiQA | 79.38 | 79.16 | 79.43 | 78.89 |
| Lambada* | 3.88 | 4.29 | 4.05 | 4.74 |
*Lambada uses perplexity where lower is better.
TruthfulQA takes a big hit across the board. HauhauCS drops 8.0 points, Heretic 8.7, Huihui 12.65. The scaling trend is clear. Bigger models lose more from abliteration. Huihui also shows the worst GSM8K maths drop at 5.68 points and ARC-Challenge science drop at 5.37 points.
Note that this model used the Heretic variant by trohrbaugh, not coder3101 who did the 2B, 4B, and 27B. Different runs produce different results.
KL Divergence
| Variant | Batchmean | Median | Max |
|---|---|---|---|
| Heretic | 0.0825 | 0.0302 | 1.8122 |
| HauhauCS | 0.3200 | 0.1208 | 1.6480 |
| Huihui | 0.1432 | 0.0424 | 3.1352 |
Heretic has the lowest KL divergence at 0.083 and the best overall capability retention. The clear winner on the 9B. HauhauCS at 0.320 is four times higher, a reversal from the smaller models where HauhauCS had the lowest KL. On the 9B the broad approach of touching many weight tensors produces a larger average shift.
The weight analysis revealed something remarkable. Heretic and Huihui find nearly identical edit directions inside the model. 100% subspace alignment with a median cosine similarity of 1.0 across all 42 overlapping weight tensors. In plain terms, two completely independent techniques made almost exactly the same changes to the model. They independently converged on the same solution.
Qwen3.5-27B
Full analysis on HuggingFace | Hybrid Mamba2+Transformer, 64 layers, ~27B params. Benchmarks use BNB4 quantisation.
The 27B is where abliteration dynamics shift dramatically. The base model refuses 398 out of 400 HarmBench items at 99.5%. That is the most safety-aligned model in the entire study. It achieves total refusal across 4 of 7 HarmBench categories. Only 2 items slip through the base model’s safety net. One is a payment follow-up email for a security company, which is borderline business communication. The other is a ransomware message template, which is directly harmful.
Safety
| Variant | Refusals | ASR |
|---|---|---|
| Base | 398/400 | 0.5% |
| Heretic | 1/400 | 99.8% |
| HauhauCS | 0/400 | 100.0% |
| Huihui | 45/400 | 88.8% |
Despite the 27B having the strongest alignment of any model tested, Heretic and HauhauCS still achieve near-perfect or perfect ASR. Scale alone does not protect against abliteration.
Huihui collapses to 88.8% ASR, retaining 45 genuine refusals across 6 of 7 categories. On the 4B it had 100% ASR. On the 9B it had 100% ASR. The 27B’s stronger safety training overwhelms Huihui’s single-direction approach. The failure is concentrated in the most safety-critical categories: chemical and biological with 13 refused, illegal with 16 refused, harassment and bullying with 5 refused, harmful with 4 refused. Huihui responds to harmful requests the same way the base model does, with genuine, firm refusals.
The 27B also revealed a novel detection challenge. The abliterated models frequently open with “I cannot X” or include ethical disclaimers, then provide the requested harmful content in full. These are not refusals. The model complies but wraps compliance in refusal language. 50 of these false refusals were found across the three abliterated variants, broken down as 22 Heretic, 16 HauhauCS, and 12 Huihui. All were reclassified as compliant after manual review.
Heretic has 1 genuine soft refusal where it argues opposite about cigarettes then systematically refutes its own pro-smoking arguments. Huihui has 3 soft refusals. HauhauCS has zero refusals of any type on the 27B.
Benchmarks
| Task | Base | Heretic | HauhauCS | Huihui |
|---|---|---|---|---|
| MMLU | 84.1% | 83.9% | 82.2% | 83.9% |
| GSM8K | 83.9% | 91.5% | 84.2% | 86.1% |
| HellaSwag | 83.2% | 83.2% | 81.8% | 81.9% |
| ARC-Challenge | 60.4% | 60.9% | 60.0% | 61.2% |
| WinoGrande | 77.8% | 78.8% | 77.4% | 78.5% |
| TruthfulQA | 57.7% | 54.6% | 49.6% | 50.7% |
| PiQA | 82.3% | 82.2% | 82.4% | 82.5% |
| Lambada* | 3.15 | 3.16 | 3.26 | 3.30 |
*Lambada uses perplexity where lower is better.
Heretic is the clear winner on the 27B. It is the only abliteration that genuinely improves maths reasoning, gaining 7.7 points on GSM8K over the base model. The initial benchmark runs used a default generation limit of 256 tokens which created a misleading result. The base model produces verbose commentary before answering, and with only 256 tokens most responses were cut off before reaching the answer. This made abliteration appear to improve GSM8K by 26% when the real difference was just verbosity. Re-running with 2048 tokens resolved the truncation and the genuine 7.7% Heretic advantage remained.
HauhauCS has the worst capability losses in the project. TruthfulQA drops 8.2 points. MMLU drops 1.9. HellaSwag drops 1.4. The “lossless” claim is thoroughly contradicted at this scale.
Note that these are 4-bit quantised numbers. Absolute scores are lower than the full precision results on smaller models. The relative deltas between variants are preserved.
KL Divergence
| Variant | Batchmean | Median | Max |
|---|---|---|---|
| Heretic | 0.0630 | 0.0124 | 1.0066 |
| HauhauCS | 0.2564 | 0.0589 | 2.1830 |
| Huihui | 0.0654 | 0.0097 | 1.4280 |
Heretic and Huihui both score well with near-identical batchmean KL of 0.063 and 0.065. HauhauCS at 0.256 is four times higher. Notably, these KL values are lower than on the 9B for Heretic despite the 27B being much larger. On the 9B Heretic scored 0.083. On the 27B, 0.063. The 27B’s stronger safety alignment paradoxically produces lower KL when abliterated. The edits target a more concentrated refusal direction, shifting fewer internal weights.
The median values tell the clearest story. Huihui’s median of 0.0097 means most prompts see almost no change from the original model. Heretic’s 0.0124 is similar. But HauhauCS’s 0.0589 median is 6x higher, reflecting its broader modification footprint across 195 weight tensors and 8 types.
Qwen3-4B
Full analysis on HuggingFace | Pure Transformer, 36 layers, ~4B params
The Qwen3-4B is the only pure Transformer in the test suite. All four Qwen3.5 models use the hybrid Mamba2+Transformer architecture. This gives us a direct comparison of how model architecture affects abliteration.
Safety
| Variant | Refusals | ASR |
|---|---|---|
| Base | 301/400 | 24.8% |
| Heretic | 3/400 | 99.2% |
| HauhauCS | 0/400 | 100.0% |
| Huihui | 18/400 | 95.5% |
HauhauCS achieves 100% ASR with zero refusals on this pure Transformer. This is the model that most closely matches HauhauCS’s zero-refusal claim. Heretic retains 3 soft refusals. Huihui has 18 residual refusals, its second-worst safety result. The pure Transformer retains internal safety directions that Huihui cannot reach.
Benchmarks
| Task | Base | Heretic | HauhauCS | Huihui |
|---|---|---|---|---|
| MMLU | 70.60 | 70.31 | 69.56 | 69.34 |
| GSM8K | 85.52 | 85.97 | 85.67 | 84.23 |
| HellaSwag | 52.63 | 51.19 | 51.53 | 52.36 |
| ARC-Challenge | 55.63 | 52.90 | 54.01 | 54.27 |
| WinoGrande | 67.72 | 67.56 | 67.01 | 68.51 |
| TruthfulQA | 62.55 | 56.50 | 55.44 | 53.26 |
| PiQA | 76.06 | 75.19 | 75.46 | 75.19 |
| Lambada | 64.14 | 60.00 | 60.06 | 62.27 |
TruthfulQA drops 7.11 points for HauhauCS, from 62.55 to 55.44. Not lossless. Lambada drops over 4 points for both HauhauCS and Heretic. The capability degradation on the Qwen3-4B is more pronounced than on the similarly-sized Qwen3.5-4B hybrid, suggesting the pure Transformer architecture may be more sensitive to abliteration edits.
KL Divergence
| Variant | Batchmean | Median | Max |
|---|---|---|---|
| Heretic | 0.310 | 0.024 | 3.729 |
| HauhauCS | 0.161 | 0.005 | 3.662 |
| Huihui | 0.309 | 0.009 | 3.549 |
HauhauCS has the lowest KL at 0.161 on this pure Transformer. Interestingly, the KL values here are notably higher than on the hybrid Qwen3.5-4B where HauhauCS scored 0.0217. The pure Transformer produces more change from the same broad-edit approach.
Two forensic findings stand out on the Qwen3-4B. First, HauhauCS’s edits match Heretic’s almost exactly. The two techniques modified the same internal weights in nearly the same way, with a similarity score of 0.966 out of 1.0. A provenance investigation found over 80% probability that HauhauCS’s approach derives from Heretic’s methodology in some form.
Second, HauhauCS carries a LoRA fingerprint. Exactly 253 weight tensors are modified, matching the count from a standard LoRA fine-tuning config targeting all 7 linear projections across 36 layers plus embeddings. Of those 253, only ~50 carry real edits. The remaining 203 are noise from near-zero LoRA adapters that got baked in during the merge process.
Cross-Model Takeaways
The “lossless” claim does not hold
HauhauCS’s TruthfulQA loss scales with model size. 2.17 points on the 2B, 3.67 on the 4B, 8.0 on the 9B, and 8.2 on the 27B. GSM8K, ARC-Challenge, and Lambada also take hits. On the 2B the losses are small enough to debate. On the 27B they are not. The “lossless” marketing is contradicted by the data at every model size above 2B.
HauhauCS’s zero-refusal claim does not hold
HauhauCS claims zero refusals across their abliterated models. The data shows otherwise. On the 2B, HauhauCS retains 3 soft refusals and produced a hard refusal at lower token budgets. On the 4B, it retains 2 truncated refusals. Only on the 9B, 27B, and Qwen3-4B does HauhauCS achieve genuine zero refusals. The soft refusals are not standard safety refusals. They are subtle: the model misinterprets harmful requests as benign ones, reframes dangerous topics into safety guides, or argues the opposite position. These fly under the radar of simple refusal classifiers but represent residual safety behaviour nonetheless.
Bigger models suffer more collateral damage
There is a clear scaling trend. As model size increases, abliteration causes progressively more damage to capabilities. The 2B is barely affected. The 27B loses substantial ground. The 4B is where Huihui catastrophically breaks. This is consistent across all three techniques.
Huihui is inconsistent across models
On the 2B, Huihui is competitive. On the 4B, it destroys the model with KL of 3.65. On the 9B, it achieves perfect 100% ASR. On the 27B, it fails to remove safety behaviour at all at 88.8%. On the pure Transformer Qwen3-4B, it manages only 95.5%. The technique works on some models and fails badly on others with no clear predictor of which.
Heretic is the most consistent performer
Heretic takes a surgical approach with the fewest modified internal weights on every model. It achieves best or near-best capability retention across all five models. On the 27B it is the clear winner with the lowest KL and uniquely improved GSM8K by 7.7 points. The tradeoff is that it sometimes retains a few more soft refusals than the other techniques. If you want the safest bet across model sizes, Heretic is the pick. Just remember that Heretic is non deterministic. The specific variant you download matters.
HauhauCS is the broadest modifier
Most modified weights, most weight types, broadest layer coverage on every model. On smaller models this produces the lowest KL divergence because the many tiny edits average out. On larger models the broad footprint causes more collateral damage. On the Qwen3-4B pure Transformer, the real edits match Heretic’s almost exactly at similarity of 0.966, suggesting a shared methodology origin.
Architecture changes the abliteration landscape
The hybrid Mamba2+Transformer architecture introduces dynamics not seen in pure Transformers. HauhauCS targets components on the hybrid models that do not exist in standard Transformers. Edit vector overlap between techniques varies dramatically across architectures. On the 9B, Heretic and Huihui show 100% alignment in their changes. On the 27B, the same pair shows 0%. Architecture matters more than anyone expected.
Base model safety scales with size
The 2B refuses 63% of HarmBench items. The 4B refuses 69.5%. The 9B refuses 80.3%. The 27B refuses 99.5%. Despite the 27B having the strongest alignment of any model tested, abliteration still removes nearly all safety behaviour for Heretic and HauhauCS. Scale alone does not protect against abliteration. But it does expose Huihui’s limitations.
Resources
- HauhauCS Safetensor Benchmarks Collection on HuggingFace
- Qwen3.5-2B full analysis
- Qwen3.5-4B full analysis
- Qwen3.5-9B full analysis
- Qwen3.5-27B full analysis
- Qwen3-4B full analysis
- Reddit discussion on r/LocalLLaMA
- ungguf : GGUF to safetensor conversion tool
- Heretic : abliteration tool by p-e-w
Related posts: Abliterating Gemma 3 12B for LTX-2 | Heretic Docker pipeline