Heretic Docker: Abliterating LLMs for Video and Image Generation

A while back I did a deep dive into abliterating Gemma 3 12B for use as an uncensored text encoder in LTX-2 video generation. The process worked but involved a lot of manual steps: running Heretic, merging safetensor shards, converting to ComfyUI format, quantizing to FP8, building GGUF quants. I got tired of doing it all by hand, so I built heretic-docker to automate the whole thing.

Why Docker?

The original workflow from my previous post had you pip installing Heretic, manually running Python scripts to merge and convert weights, building llama.cpp from source for GGUF conversion, and doing FP8 quantization with yet another script. It worked, but every time I wanted to abliterate a new model I was copy-pasting commands and hoping I didn’t miss a step.

heretic-docker wraps the entire pipeline into two commands:

# Abliterate - interactive, you pick the trial
./heretic.sh abliterate google/gemma-3-12b-it

# Convert to all formats
./heretic.sh convert /output/hf-model my-model-name

The convert step handles everything: merging shards, converting to ComfyUI format with vision weights preserved, FP8 quantization, NVFP4 quantization, and GGUF conversion with multiple quant levels. One command, all the outputs.

The container is built on NVIDIA’s NGC PyTorch base image which has CUDA kernels for Blackwell RTX 50-series GPUs. It also includes patches for SDPA attention and a bitsandbytes stub since there’s no CUDA 13.1 binary yet. These patches are transparent though, it works fine on older GPUs like the RTX 4090 too.

The Models

So far I’ve used heretic-docker to produce three abliterated models for different generation pipelines:

Model	Base	Trials	Refusals	KL Divergence	Use Case
Gemma 3 12B Heretic v2	google/gemma-3-12b-it	200	8/100	0.0801	LTX-2 / LTX 2.3 video
Gemma 3 12B Heretic v1	google/gemma-3-12b-it	100	7/100	0.0826	LTX-2 / LTX 2.3 video
Qwen 3 4B Heretic	Qwen/Qwen3-4B	200	3/100	0.0000	Z-Image / FLUX.2 Klein 4B

For anyone unfamiliar, abliteration is the process of removing the refusal behavior from a model. It works by finding and suppressing the internal directions that cause a model to refuse certain prompts. Heretic automates this with an optimization process that tries hundreds of trials and presents you with the best tradeoffs between fewer refusals and less model damage, measured by KL divergence.

Gemma 3 12B Heretic v2

This is the one to use for LTX-2 and the newer LTX 2.3. Both use Gemma 3 12B as their text encoder so the same heretic model works as a drop-in replacement for either. Compared to v1 which used Heretic v1.1.0 with 100 trials, v2 was abliterated with Heretic v1.2.0 and ran 200 trials. The big improvement is that vision weights are now preserved in the ComfyUI format. The vision_model and multi_modal_projector keys are kept intact. This means it works with image-to-video workflows using TextGenerateLTX2Prompt with an image input, which v1 couldn’t do.

It also comes with an NVFP4 variant at ~7.8GB, roughly 3x smaller than the bf16 version. More on that below.

HuggingFace: DreamFast/gemma-3-12b-it-heretic-v2

Gemma 3 12B Heretic v1

This was the original from my previous post. Still available but v2 is better in every way with more trials, vision support and NVFP4. If you’re using v1, its worth upgrading.

HuggingFace: DreamFast/gemma-3-12b-it-heretic

Qwen 3 4B Heretic

This one is for image generation rather than video. Z-Image and FLUX.2 Klein 4B both use Qwen 3 4B as their text encoder, so I ran it through the same pipeline.

The interesting result here is that Qwen achieved zero measurable KL divergence. The abliteration was essentially surgical with no detectable damage to the model. Trial 96 gave us only 3/100 refusals with a KL of 0.0000. Compare that to Gemma where even the best trials had KL around 0.08. Different architectures clearly respond to abliteration very differently.

HuggingFace: DreamFast/qwen3-4b-heretic

NVFP4 and Blackwell

The v2 Gemma and Qwen models both include NVFP4 quantized variants. NVFP4 is a 4-bit floating point format that uses ComfyUI’s native quantization, no plugins needed, it just loads. On Blackwell GPUs like the RTX 5090 and RTX 5080 it uses native FP4 tensor cores for the best performance, but ComfyUI also supports software dequantization on older GPUs. I’ve tested the NVFP4 variants on an RTX 4090 and they work fine.

For Gemma 3 12B, the NVFP4 file is ~7.8GB compared to ~23GB for bf16. For Qwen 3 4B its ~2.6GB compared to ~7.5GB. Pretty significant savings if you’re tight on VRAM.

Resources

heretic-docker - The Docker pipeline
Heretic - The abliteration tool by p-e-w
DreamFast/gemma-3-12b-it-heretic-v2 - Gemma 3 12B abliterated, v2 recommended
DreamFast/gemma-3-12b-it-heretic - Gemma 3 12B abliterated v1
DreamFast/qwen3-4b-heretic - Qwen 3 4B abliterated
LTX-2 - Lightricks’ video generation model
Z-Image - Zhipu AI’s image generation model
ComfyUI-LTX2-MultiGPU - Multi-GPU workflows for LTX-2

Nathan Sapwell