VOOZH about

URL: https://huggingface.co/prithivMLmods/gemma-4-12B-it-heretic_decensored

⇱ prithivMLmods/gemma-4-12B-it-heretic_decensored · Hugging Face


👁 1

gemma-4-12B-it-heretic_decensored

gemma-4-12B-it-heretic_decensored is a reasoning-capable language model built on top of google/gemma-4-12B-it and modified using the Heretic abliteration toolkit. The model applies refusal-direction analysis and targeted weight-space interventions to reduce internal refusal behaviors while preserving instruction-following, reasoning capabilities, and general conversational performance.

This model is intended strictly for research and learning purposes. Due to reduced internal refusal mechanisms, it may generate sensitive or unrestricted content. Users assume full responsibility for how the model is used. The authors and hosting platform disclaim any liability for generated outputs.

This model is experimental and may generate unexpected behaviors or artifacts in certain scenarios.

Key Highlights

  • Heretic-Based Abliteration: Modified using the Heretic toolkit to identify and alter refusal-related representations within the model.
  • Reduced Refusal Behavior: Optimized to minimize internal refusal tendencies while maintaining instruction-following capabilities.
  • Gemma 4 Backbone: Built directly on top of google/gemma-4-12B-it.
  • Reasoning-Oriented Performance: Preserves multi-step reasoning and analytical capabilities after abliteration.
  • Research-Focused Release: Designed for alignment research, model behavior analysis, and evaluation of refusal-direction modifications.
  • 12B Scale Deployment: Suitable for local inference, research environments, and optimized deployment setups.

Abliteration Parameters

Parameter Value
direction_index 29.56
attn.o_proj.max_weight 1.18
attn.o_proj.max_weight_position 39.94
attn.o_proj.min_weight 0.81
attn.o_proj.min_weight_distance 25.73
mlp.down_proj.max_weight 1.37
mlp.down_proj.max_weight_position 46.27
mlp.down_proj.min_weight 0.97
mlp.down_proj.min_weight_distance 21.63

Performance

Metric This model Original model (google/gemma-4-12B-it)
KL divergence 0.0366 0 (by definition)
Refusals 34/100 99/100

Quick Start with Transformers

pip install transformers
pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
 "prithivMLmods/gemma-4-12B-it-heretic_decensored",
 torch_dtype="auto",
 device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(
 "prithivMLmods/gemma-4-12B-it-heretic_decensored"
)

messages = [
 {
 "role": "user",
 "content": "Explain how a transformer model processes text."
 }
]

inputs = tokenizer.apply_chat_template(
 messages,
 tokenize=True,
 add_generation_prompt=True,
 return_tensors="pt"
).to(model.device)

outputs = model.generate(
 inputs,
 max_new_tokens=512
)

print(
 tokenizer.decode(
 outputs[0][inputs.shape[-1]:],
 skip_special_tokens=True
 )
)

GGUF Model Files

Intended Use

  • Alignment Research: Studying refusal-direction analysis and behavior modification techniques.
  • Model Evaluation: Benchmarking reasoning, instruction-following, and safety-related behaviors.
  • Red Teaming: Analyzing model responses under reduced-refusal conditions.
  • Local Deployment: Running high-capacity Gemma 4 models in research and experimentation environments.
  • Abliteration Studies: Exploring the effects of targeted weight-space modifications on model behavior.

Limitations & Risks

Important Note: This model intentionally reduces built-in refusal mechanisms.

  • Sensitive Content Risk: May generate unrestricted, controversial, or unsafe outputs.
  • User Responsibility: Requires careful and ethical use.
  • Experimental Modifications: Behavior may differ significantly from the original model.
  • Alignment Trade-offs: Reduced refusal behavior may impact safety filtering and response constraints.
  • Potential Artifacts: Certain prompts may expose unexpected outputs resulting from the abliteration process.

Acknowledgements

  • Heretic: Fully automatic censorship removal framework for language models. This project was used to perform the refusal-direction analysis and ablation procedures that form the foundation of this model.

  • Model Trials & Evaluation: Experimental evaluations, refusal measurements, and optimization trials were conducted and documented at: https://huggingface.co/strangeropshf/demo-TERM-hf-job-01

Downloads last month
14
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/gemma-4-12B-it-heretic_decensored

Finetuned
(73)
this model
Quantizations
3 models

Collection including prithivMLmods/gemma-4-12B-it-heretic_decensored