VOOZH about

URL: https://huggingface.co/arpdevgroup/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

⇱ arpdevgroup/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx · Hugging Face


Gemma-4-31B-it Claude Opus Max Thinking (MLX)

This model is a highly optimized, locally-trained version of Google's Gemma-4-31B-it, fine-tuned specifically to emulate the advanced step-by-step reasoning and "Max Thinking" cognitive architecture of Claude Opus.

It was fine-tuned entirely on Apple Silicon utilizing the Apple MLX framework and is fused natively for macOS environments.

🧠 Model Description

  • Base Model: google/gemma-4-31B-it
  • Dataset: 11-47/claude_opus_4.8_max_thinking_5k_v2
  • Framework: Apple MLX (mlx_lm)
  • Quantization: 4-bit Microscaling Float (mxfp4)
  • Hardware: Trained natively on an Apple M5 Pro (Unified Memory Architecture)

This model was trained using a custom LoRA adapter focusing purely on the output reasoning structures (--mask-prompt enabled) to prevent prompt-memorization and ensure robust zero-shot logical deduction.

🚀 How to Use (Apple Silicon)

This model is fully fused and ready for immediate deployment on Macs using the MLX framework.

1. Install Dependencies

Make sure you have the MLX ecosystem installed:

pip install mlx-lm

2. Run via Command Line Interface (CLI)

You can chat with the model interactively straight from your terminal:

mlx_lm.chat --model PishangShedappp/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

3. Run via Python API

If you want to integrate this reasoning model into your own macOS applications or backend scripts:

from mlx_lm import load, generate

# Load the model and tokenizer
model, tokenizer = load("PishangShedappp/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx")

# Format your prompt using the Gemma chat template
messages = [{"role": "user", "content": "Explain why quantum computing is different from classical computing step-by-step."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate the reasoning output
response = generate(
 model, 
 tokenizer, 
 prompt=prompt, 
 max_tokens=1000, 
 verbose=True
)

4. Host as a Local API Server

You can expose this model as a local, OpenAI-compatible API endpoint:

mlx_lm.server --model PishangShedappp/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx --port 8080

🛠️ Training Configuration

The model was fine-tuned with the following hyper-parameters using mlx_lm.lora:

  • Iterations: 1,500
  • Batch Size: 1
  • Learning Rate: 2e-5
  • Adapter Rank (LoRA): 8 (16.3M Trainable Parameters)
  • Prompt Masking: Enabled (Loss calculated only on generation targets)

⚠️ Requirements

Because this is a 31B parameter model compressed to mxfp8, it requires a Mac with at least 48GB of Unified Memory to run comfortably during inference, and 64GB of Unified Memory for high-context workloads or local re-training.

LM Studio Notes

This model needs the LM Studio per-model default config to stop on: <turn|>, <|turn>, <channel|>, <|channel>, and <pad>.

The LM Studio prompt template has been simplified to plain Gemma turn blocks so it does not depend on renderer-specific message.get(...) behavior. The thinking channel now only emits when enable_thinking is explicitly enabled.

Downloads last month
474
Safetensors
Model size
31B params
Tensor type
U8
·
U32
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Model tree for arpdevgroup/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx

Quantized
(252)
this model

Dataset used to train arpdevgroup/gemma-4-31B-it-claude-opus-4.8-max-thinking-mxfp8-mlx