Use MLX for running models on Apple Silicon Macs with Metal GPU acceleration.
mlx-lm package provides a simple interface for loading and serving LLMs.
Installation
Install the MLX language model package:pip install mlx-lm
Basic Usage
Themlx-lm package provides a simple interface for text generation with MLX models.
See the Models page for all available MLX models, or browse MLX community models at mlx-community LFM2 models.
from mlx_lm import load, generate
# Load model and tokenizer
model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")
# Generate text
prompt = "What is machine learning?"
# Apply chat template
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, tokenizer=False, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
print(response)
Generation Parameters
Control text generation behavior using parameters in thegenerate() function. Key parameters:
temperature(float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0top_p(float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0top_k(int, default 50): Limits to top-k most probable tokens. Typical range: 1-100max_tokens(int): Maximum number of tokens to generaterepetition_penalty(float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5
response = generate(
model,
tokenizer,
prompt=prompt,
temperature=0.3,
min_p=0.15,
repetition_penalty=1.05,
max_tokens=512
)
Streaming Generation
Stream responses withstream_generate():
from mlx_lm import load, stream_generate
model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")
messages = [{"role": "user", "content": "Tell me a story about space exploration."}]
prompt = tokenizer.apply_chat_template(
messages, tokenizer=False, add_generation_prompt=True
)
for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=512):
print(token, end="", flush=True)
Serving with mlx-lm
MLX can serve models through an OpenAI-compatible API. Start a server with:mlx_lm.server --model mlx-community/LFM2-1.2B-8bit --port 8080
Using the Server
Once running, use the OpenAI Python client:from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="mlx-community/LFM2-1.2B-8bit",
messages=[
{"role": "user", "content": "Explain quantum computing."}
],
temperature=0.3,
max_tokens=512,
extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)
Vision Models
LFM2-VL models support both text and image inputs for multimodal inference. Usemlx_vlm to load and generate with vision models:
