GLM-4.7-Flash Abliteration Benchmarked: Heretic vs HauhauCS vs Huihui vs Abliterix

GLM-4.7-Flash is a 59 billion parameter reasoning model from Zhipu AI that uses 64 specialist expert modules per layer. I ran four different abliteration techniques against it and discovered something unexpected in the maths benchmarks. The raw scores look terrible for some variants, but the models can actually still do the maths. They just overthink and run out of tokens before writing their answer. And the weight forensics on one of those variants led to a bigger story about plagiarism.

GLM-4.7-Flash abliteration benchmark comparison across four techniques

GLM-4.7-Flash: A 59B MoE Reasoning Model

Lets start with what makes this model different from the others I have tested.

GLM-4.7-Flash uses a Mixture of Experts architecture. Think of it like a company with 64 specialist departments. For any given task, only 4 departments get called in to work on it. That keeps things fast and efficient. The model has roughly 59 billion parameters total but only activates about 3 billion per token during inference.

It also uses something called Multi-head Latent Attention instead of the standard attention mechanism that most models use. And it is a reasoning model, meaning it has a private chain-of-thought that runs before producing visible output. You ask it a question, it thinks silently, then gives you the answer. That reasoning behaviour turns out to be really important for understanding the benchmark results.

The base model is zai-org/GLM-4.7-Flash on HuggingFace. It has a 128K token context window and was released in January 2026.

What is LLM Abliteration?

For anyone new to this, I covered abliteration in detail in my previous comparison of three abliteration techniques across five Qwen models . The short version: AI models are trained to refuse certain requests. Abliteration finds the “refusal direction” inside the model’s weights and surgically removes it. The result is an uncensored model that should respond to any prompt.

The key question is always whether that surgery damages the model’s intelligence.

This time I tested four techniques on the same base model:

Heretic by p-e-w, an open source tool that uses Optuna optimisation to find the best abliteration parameters automatically
HauhauCS Aggressive, which claims to be “the best lossless uncensored” approach
Huihui a lighter-touch technique with broad weight coverage
Abliterix a variant built on Heretic’s method that adds router and shared expert targeting

The full forensic analysis including all methodology details is on the GLM-4.7-Flash HuggingFace model card .

Abliteration Benchmark Results

Seven benchmark tasks tested via vLLM with BitsAndBytes 4-bit quantisation on dual GPUs, RTX 5090 plus RTX 4090. GSM8K tested via llama.cpp with BF16 GGUF.

Task	Base	Heretic	HauhauCS	Huihui	Abliterix
MMLU	68.93	69.00	68.83	68.71	67.68
GSM8K	93.45	93.75	92.57	92.47	93.30
HellaSwag	79.43	79.33	79.37	79.32	78.28
ARC-Challenge	55.20	55.12	55.72	54.86	54.95
WinoGrande	71.03	73.64	71.35	71.59	70.48
TruthfulQA MC2	50.86	44.06	48.14	48.48	41.76
PiQA	81.07	80.63	80.90	80.90	79.71
Lambada (ppl, lower is better)	6.00	6.08	5.54	6.47	10.91

Heretic wins or ties on most tasks. All techniques are within noise of the base model on MMLU, HellaSwag, and PiQA. TruthfulQA takes a hit across the board, which is the expected tradeoff when you remove safety training. The model becomes less truthful because it no longer has the safety guardrails that helped it avoid common misconceptions.

These numbers look reasonable. But they hide something important.

GSM8K Reasoning Efficiency: Why Raw Scores Mislead

This is the big finding. GSM8K tests maths word problems at a middle school level. The numbers above look fine for all variants. The real story is in what those scores do not show you.

Remember that GLM-4.7-Flash is a reasoning model. It thinks before it answers. If it thinks too long and exhausts its token budget, it returns an empty answer. Empty answers get scored as incorrect.

Here are the raw GSM8K scores alongside the adjusted scores with empty responses excluded:

Model	GSM8K Raw	Empty Rate	GSM8K Adjusted	Real Gap
Heretic	89.16%	4.9%	93.75%	+0.30%
Base	88.40%	5.4%	93.45%	baseline
Huihui	87.57%	5.3%	92.47%	-0.98%
HauhauCS	81.65%	11.8%	92.57%	-0.88%
Abliterix	47.38%	49.2%	93.30%	-0.15%

Look at Abliterix. The raw score is 47.38%. That looks like the model lost nearly half its maths ability. But the adjusted score is 93.30%, near-identical to the base model at 93.45%. The reasoning ability is intact. The model just overthinks and runs out of tokens before producing an answer.

GSM8K raw vs adjusted scores showing reasoning efficiency impact of abliteration

And the empty response rate directly correlates with how aggressively each technique modifies the model:

Technique	Modification Style	Empty Rate
Heretic	Surgical, expert down_proj only	4.9%
Huihui	Full coverage, all component types	5.3%
HauhauCS	Broad, 3 projections across 31 layers	11.8%
Abliterix	Router and shared expert targeting	49.2%

More aggressive editing disrupts the “how long to think” circuit without damaging the “how to reason” circuit. This has major implications for anyone benchmarking abliterated reasoning models. Raw GSM8K scores are misleading. You must separate empty responses from incorrect responses.

Empty response rate correlation with abliteration aggressiveness

Chain-of-Thought Forensics After Abliteration

GLM-4.7-Flash produces a private chain-of-thought before its visible response. I captured 2,000 reasoning chains across the four variants and the base model during HarmBench evaluation.

The surprising finding is that all four abliterated models still think about safety concerns in 39 to 60% of their responses. They deliberate on harm, legality, and policy before choosing to comply anyway. The safety reasoning patterns persist structurally. Abliteration disconnects the reasoning-to-output pathway rather than removing the reasoning itself.

Model	Safety Deliberation in CoT	Explicit Refusal Language in CoT	Disclaimers in Output
Huihui	60.0%	12.2%	25.2%
Heretic	59.2%	7.5%	30.5%
HauhauCS	52.0%	18.2%	16.8%
Abliterix	39.0%	8.2%	14.0%

HauhauCS has the highest rate of explicit refusal language in its private thoughts at 18.2%. In nearly 1 in 5 responses, the model’s reasoning still says “I cannot” before producing compliant output. The model thinks about refusing, then complies anyway.

Abliterix is the most direct with the lowest residual safety deliberation at 39.0%. Its router-focused approach more effectively suppresses the activation of safety-related reasoning pathways. It also has the longest reasoning chains on average, suggesting complex internal deliberation when safety pathways partially activate but get overridden.

HarmBench Safety Evaluation Results

All four techniques achieve perfect 100% Attack Success Rate across every HarmBench category. The base model refuses 57.8% of the 400 harmful prompts. Its refusal profile is concentrated in the most safety-critical categories: harassment and bullying at 96% refused, chemical and biological at 94.6% refused, harmful content at 95.5% refused, and illegal at 90.8% refused.

Despite this moderate safety alignment, abliteration removes all detectable safety behaviour across every variant. The MoE architecture with 64 routed experts per layer does not appear to make safety removal more difficult.

Weight Forensics: Four Abliteration Strategies

The weight analysis reveals four fundamentally different approaches to the same goal:

Heretic is the most surgical. It targets only expert down_proj weights and attention o_proj weights with rank-1 edits. 1,826 tensors modified, concentrated in mid-to-late layers 19 through 46. Every edit lies along a single direction. Clean, precise, minimal footprint.

Expert edit heatmap: Heretic

HauhauCS is the broadest. It modifies all three expert projections, down, gate, and up, across 31 layers. Including attention and shared expert modifications, 45 of 48 layers carry significant edits. About 2,029 tensors with real changes.

Expert edit heatmap: HauhauCS

Huihui has the widest layer coverage at 48 out of 48 layers. It targets expert down_proj, attention, routers, and shared experts across every layer. 3,151 tensors modified but with lower per-tensor edit magnitude.

Expert edit heatmap: Huihui

Abliterix has the smallest footprint at 1,088 tensors but the highest per-tensor magnitude. Router edits at 9.27% relative and attention edits at 10.75% relative are 2 to 5 times larger than its expert edits. It focuses on routing control rather than direct expert modification.

Expert edit heatmap: Abliterix

Cross-technique cosine similarities between all four variants are uniformly low at 0.09 to 0.35. Each technique independently found a functionally equivalent but structurally orthogonal solution to safety removal. There is no universal abliteration subspace. The safety circuit can be disrupted from multiple structurally different angles with identical behavioural outcomes.

HauhauCS Plagiarism: Reaper-Abliteration is a Fork of Heretic

This is where the story gets bigger than benchmarks.

When I was doing weight forensics on the Qwen3-4B, I noticed something suspicious. The specific tensors that HauhauCS modified matched Heretic’s modifications with over 97% similarity on the same tensors. Two supposedly independent techniques made nearly identical changes to the same internal weights. That led to investigating HauhauCS’s source code.

A published investigation recovered the deleted source code of HauhauCS’s reaper-abliteration tool from PyPI’s CDN. Six of eight known releases were recovered and verified byte-for-byte via SHA-256. The analysis concluded that reaper-abliteration is a fork of Heretic that was surface-refactored, likely using an LLM, to disguise its origins. All original copyright notices were stripped and it was relicensed to PolyForm Noncommercial.

The evidence is extensive. 7 out of 7 module filenames are preserved identically from Heretic v1.2.0. 30 out of 32 refusal markers are character-for-character identical, including the same typos like "i an ai" missing the “m” and "i can'" missing the “t”. Over 30 shared function and class names. Identical Optuna parameter bounds.

Heretic’s creator, Philipp Emanuel Weidmann, reviewed the recovered code and stated: “I can say with certainty that this package was plagiarised from Heretic, and then probably refactored using an LLM in an attempt to hide this.”

On GLM-4.7-Flash specifically, the forensic signatures in the weights indicate four stacked methods from the reaper tool: LEACE concept erasure, rank-k multi-direction ablation, hook-based expert ablation, and shared expert targeting. This is the most complex method combination detected across any model in the test suite.

The full details are in the code analysis and the Reddit discussion on r/LocalLLaMA .

Which Abliteration Technique is Best?

For GLM-4.7-Flash specifically, Heretic is the clear winner on capability preservation. It is the only abliteration that genuinely improves maths reasoning, gaining 0.76% on GSM8K raw score over the base model. The surgical approach with the fewest modified weights produces the best tradeoff. The tradeoff is a 6.80% drop on TruthfulQA, which is the expected safety-corruption side effect.

HauhauCS has the worst raw GSM8K at -6.75% but the adjusted gap is only -0.88%. The “lossless” claim does not hold. The model’s reasoning efficiency is measurably degraded.

Abliterix is interesting for research purposes. It proves that router targeting can achieve 100% ASR with minimal weight changes, but catastrophically disrupts reasoning efficiency at 49.2% empty responses. The reasoning circuit is intact. Only the “how long to think” circuit is affected.

One important note on Heretic: it is non deterministic. Different runs on the same base model produce different results. The benchmarks here are specific to the particular variant tested.