gemma-4-12B-it-heretic_decensored
gemma-4-12B-it-heretic_decensored is a reasoning-capable language model built on top of google/gemma-4-12B-it and modified using the Heretic abliteration toolkit. The model applies refusal-direction analysis and targeted weight-space interventions to reduce internal refusal behaviors while preserving instruction-following, reasoning capabilities, and general conversational performance.
This model is intended strictly for research and learning purposes. Due to reduced internal refusal mechanisms, it may generate sensitive or unrestricted content. Users assume full responsibility for how the model is used. The authors and hosting platform disclaim any liability for generated outputs.
This model is experimental and may generate unexpected behaviors or artifacts in certain scenarios.
Key Highlights
- Heretic-Based Abliteration: Modified using the Heretic toolkit to identify and alter refusal-related representations within the model.
- Reduced Refusal Behavior: Optimized to minimize internal refusal tendencies while maintaining instruction-following capabilities.
- Gemma 4 Backbone: Built directly on top of google/gemma-4-12B-it.
- Reasoning-Oriented Performance: Preserves multi-step reasoning and analytical capabilities after abliteration.
- Research-Focused Release: Designed for alignment research, model behavior analysis, and evaluation of refusal-direction modifications.
- 12B Scale Deployment: Suitable for local inference, research environments, and optimized deployment setups.
Abliteration Parameters
| Parameter | Value |
|---|---|
| direction_index | 29.56 |
| attn.o_proj.max_weight | 1.18 |
| attn.o_proj.max_weight_position | 39.94 |
| attn.o_proj.min_weight | 0.81 |
| attn.o_proj.min_weight_distance | 25.73 |
| mlp.down_proj.max_weight | 1.37 |
| mlp.down_proj.max_weight_position | 46.27 |
| mlp.down_proj.min_weight | 0.97 |
| mlp.down_proj.min_weight_distance | 21.63 |
Performance
| Metric | This model | Original model (google/gemma-4-12B-it) |
|---|---|---|
| KL divergence | 0.0366 | 0 (by definition) |
| Refusals | 34/100 | 99/100 |
Quick Start with Transformers
pip install transformers
pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained(
"prithivMLmods/gemma-4-12B-it-heretic_decensored",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(
"prithivMLmods/gemma-4-12B-it-heretic_decensored"
)
messages = [
{
"role": "user",
"content": "Explain how a transformer model processes text."
}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=512
)
print(
tokenizer.decode(
outputs[0][inputs.shape[-1]:],
skip_special_tokens=True
)
)
GGUF Model Files
| Resource | Link |
|---|---|
prithivMLmods/gemma-4-12B-it-heretic_decensored-GGUF |
https://huggingface.co/prithivMLmods/gemma-4-12B-it-heretic_decensored-GGUF |
| Quick Start with llama.cpp (Docker) | https://huggingface.co/prithivMLmods/gemma-4-12B-it-heretic_decensored-GGUF#quick-start-with-llamacpp-docker |
Intended Use
- Alignment Research: Studying refusal-direction analysis and behavior modification techniques.
- Model Evaluation: Benchmarking reasoning, instruction-following, and safety-related behaviors.
- Red Teaming: Analyzing model responses under reduced-refusal conditions.
- Local Deployment: Running high-capacity Gemma 4 models in research and experimentation environments.
- Abliteration Studies: Exploring the effects of targeted weight-space modifications on model behavior.
Limitations & Risks
Important Note: This model intentionally reduces built-in refusal mechanisms.
- Sensitive Content Risk: May generate unrestricted, controversial, or unsafe outputs.
- User Responsibility: Requires careful and ethical use.
- Experimental Modifications: Behavior may differ significantly from the original model.
- Alignment Trade-offs: Reduced refusal behavior may impact safety filtering and response constraints.
- Potential Artifacts: Certain prompts may expose unexpected outputs resulting from the abliteration process.
Acknowledgements
Heretic: Fully automatic censorship removal framework for language models. This project was used to perform the refusal-direction analysis and ablation procedures that form the foundation of this model.
Model Trials & Evaluation: Experimental evaluations, refusal measurements, and optimization trials were conducted and documented at: https://huggingface.co/strangeropshf/demo-TERM-hf-job-01
- Downloads last month
- 14
