Gemma-4-31B-it Claude Opus Max Thinking (MLX)

This model is a highly optimized, locally-trained version of Google's Gemma-4-31B-it, fine-tuned specifically to emulate the advanced step-by-step reasoning and "Max Thinking" cognitive architecture of Claude Opus.

It was fine-tuned entirely on Apple Silicon utilizing the Apple MLX framework and is fused natively for macOS environments.

🧠 Model Description

Base Model: google/gemma-4-31B-it
Dataset: 11-47/claude_opus_4.8_max_thinking_5k_v2
Framework: Apple MLX (mlx_lm)
Quantization: 4-bit Microscaling Float (mxfp4)
Hardware: Trained natively on an Apple M5 Pro (Unified Memory Architecture)

This model was trained using a custom LoRA adapter focusing purely on the output reasoning structures (--mask-prompt enabled) to prevent prompt-memorization and ensure robust zero-shot logical deduction.

🚀 How to Use (Apple Silicon)

This model is fully fused and ready for immediate deployment on Macs using the MLX framework.

1. Install Dependencies

Make sure you have the MLX ecosystem installed:

pip install mlx-lm

2. Run via Command Line Interface (CLI)

You can chat with the model interactively straight from your terminal:

mlx_lm.chat --model PishangShedappp/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

3. Run via Python API

If you want to integrate this reasoning model into your own macOS applications or backend scripts:

from mlx_lm import load, generate

# Load the model and tokenizer
model, tokenizer = load("PishangShedappp/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx")

# Format your prompt using the Gemma chat template
messages = [{"role": "user", "content": "Explain why quantum computing is different from classical computing step-by-step."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate the reasoning output
response = generate(
 model, 
 tokenizer, 
 prompt=prompt, 
 max_tokens=1000, 
 verbose=True
)

4. Host as a Local API Server

You can expose this model as a local, OpenAI-compatible API endpoint:

mlx_lm.server --model PishangShedappp/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx --port 8080

🛠️ Training Configuration

The model was fine-tuned with the following hyper-parameters using mlx_lm.lora:

Iterations: 1,500
Batch Size: 1
Learning Rate: 2e-5
Adapter Rank (LoRA): 8 (16.3M Trainable Parameters)
Prompt Masking: Enabled (Loss calculated only on generation targets)

⚠️ Requirements

Because this is a 31B parameter model compressed to mxfp8, it requires a Mac with at least 48GB of Unified Memory to run comfortably during inference, and 64GB of Unified Memory for high-context workloads or local re-training.

LM Studio Notes

This model needs the LM Studio per-model default config to stop on: <turn|>, <|turn>, <channel|>, <|channel>, and <pad>.

The LM Studio prompt template has been simplified to plain Gemma turn blocks so it does not depend on renderer-specific message.get(...) behavior. The thinking channel now only emits when enable_thinking is explicitly enabled.

Downloads last month: 474

Safetensors

Model size

31B params

Tensor type

U32

BF16

MLX

Hardware compatibility

8-bit

Model tree for arpdevgroup/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Quantized

(252)

this model

URL: https://huggingface.co/arpdevgroup/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

⇱ arpdevgroup/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx · Hugging Face