Gemma-4-12B-it Claude Opus Max Thinking (MLX)
This model is a highly optimized, locally-trained version of Google's Gemma-4-12B-it, fine-tuned specifically to emulate the advanced step-by-step reasoning and "Max Thinking" cognitive architecture of Claude Opus.
Because the base Gemma 4 model utilizes the new gemma4_unified multi-modal architecture, this model requires the Apple MLX Vision-Language (mlx-vlm) framework to run, even for purely text-based reasoning tasks. It was fine-tuned entirely on Apple Silicon and is fused natively for macOS environments.
🧠 Model Description
- Base Model:
google/gemma-4-12B-it - Dataset:
11-47/claude_opus_4.8_max_thinking_5k_v2 - Framework: Apple MLX Vision-Language (
mlx_vlm) - Quantization: 8-bit Microscaling Float (
mxfp8) - Hardware: Trained natively on Apple Silicon (Unified Memory Architecture)
This model was trained using a custom LoRA adapter focusing purely on the output reasoning structures (--train-on-completions enabled) to prevent prompt-memorization and ensure robust zero-shot logical deduction.
🚀 How to Use (Apple Silicon)
This model is fully fused and ready for immediate deployment on Macs using the mlx-vlm package.
1. Install Dependencies
Make sure you have the multi-modal MLX ecosystem installed:
pip install mlx-vlm
2. Run via Command Line Interface (CLI)
You can prompt the model interactively straight from your terminal:
python3 -m mlx_vlm.generate \
--model PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking \
--max-tokens 1000 \
--prompt "Explain why quantum computing is different from classical computing step-by-step."
3. Run via Python API
If you want to integrate this reasoning model into your own macOS applications or backend scripts:
from mlx_vlm import load, generate
# Load the fused model and processor
model, processor = load("PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking")
# Format your prompt
prompt = "Explain why quantum computing is different from classical computing step-by-step."
# Generate the reasoning output
response = generate(
model,
processor,
prompt=prompt,
max_tokens=1000,
verbose=True
)
4. Host as a Local API Server
You can expose this model as a local, OpenAI-compatible API endpoint to connect to your IDEs or applications:
python3 -m mlx_vlm.server --model PishangShedappp/gemma-4-12B-it-claude-opus-4.8-max-thinking --port 8080
🛠️ Training Configuration
The model was fine-tuned with the following hyper-parameters using mlx_vlm.lora:
- Iterations: 1,500
- Batch Size: 1
- Learning Rate:
2e-5 - Trainable Parameters: 41.41 Million (0.347% of total weights)
- Total Parameters: 11.95 Billion
- Prompt Masking: Enabled via
--train-on-completions(Loss calculated only on generation targets) - Peak Training Memory: 24.726 GB
- Final Train Loss: 0.00001795
⚠️ Requirements
Because this is a 12B parameter model compressed to mxfp8 utilizing the gemma4_unified architecture, it requires a Mac with at least 32GB of Unified Memory to run comfortably during inference, and 64GB of Unified Memory for local LoRA re-training or handling massive context windows.
- Downloads last month
- 853
8-bit
