VOOZH about

URL: https://huggingface.co/PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

⇱ PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking-mxfp8-mlx · Hugging Face


Gemma-4-12B-it Claude Opus Max Thinking (MLX)

This model is a highly optimized, locally-trained version of Google's Gemma-4-12B-it, fine-tuned specifically to emulate the advanced step-by-step reasoning and "Max Thinking" cognitive architecture of Claude Opus.

Because the base Gemma 4 model utilizes the new gemma4_unified multi-modal architecture, this model requires the Apple MLX Vision-Language (mlx-vlm) framework to run, even for purely text-based reasoning tasks. It was fine-tuned entirely on Apple Silicon and is fused natively for macOS environments.

🧠 Model Description

  • Base Model: google/gemma-4-12B-it
  • Dataset: 11-47/claude_opus_4.8_max_thinking_5k_v2
  • Framework: Apple MLX Vision-Language (mlx_vlm)
  • Quantization: 8-bit Microscaling Float (mxfp8)
  • Hardware: Trained natively on Apple Silicon (Unified Memory Architecture)

This model was trained using a custom LoRA adapter focusing purely on the output reasoning structures (--train-on-completions enabled) to prevent prompt-memorization and ensure robust zero-shot logical deduction.

🚀 How to Use (Apple Silicon)

This model is fully fused and ready for immediate deployment on Macs using the mlx-vlm package.

1. Install Dependencies

Make sure you have the multi-modal MLX ecosystem installed:

pip install mlx-vlm

2. Run via Command Line Interface (CLI)

You can prompt the model interactively straight from your terminal:

python3 -m mlx_vlm.generate \
 --model PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking \
 --max-tokens 1000 \
 --prompt "Explain why quantum computing is different from classical computing step-by-step."

3. Run via Python API

If you want to integrate this reasoning model into your own macOS applications or backend scripts:

from mlx_vlm import load, generate

# Load the fused model and processor
model, processor = load("PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking")

# Format your prompt
prompt = "Explain why quantum computing is different from classical computing step-by-step."

# Generate the reasoning output
response = generate(
 model, 
 processor, 
 prompt=prompt, 
 max_tokens=1000, 
 verbose=True
)

4. Host as a Local API Server

You can expose this model as a local, OpenAI-compatible API endpoint to connect to your IDEs or applications:

python3 -m mlx_vlm.server --model PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking --port 8080

🛠️ Training Configuration

The model was fine-tuned with the following hyper-parameters using mlx_vlm.lora:

  • Iterations: 1,500
  • Batch Size: 1
  • Learning Rate: 2e-5
  • Trainable Parameters: 41.41 Million (0.347% of total weights)
  • Total Parameters: 11.95 Billion
  • Prompt Masking: Enabled via --train-on-completions (Loss calculated only on generation targets)
  • Peak Training Memory: 24.726 GB
  • Final Train Loss: 0.00001795

⚠️ Requirements

Because this is a 12B parameter model compressed to mxfp8 utilizing the gemma4_unified architecture, it requires a Mac with at least 32GB of Unified Memory to run comfortably during inference, and 64GB of Unified Memory for local LoRA re-training or handling massive context windows.


Downloads last month
853
Safetensors
Model size
3B params
Tensor type
U8
·
U32
·
BF16
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Model tree for PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

Quantized
(214)
this model

Dataset used to train PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

Collection including PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking-mxfp8-mlx