Voozh

Use MLX for running models on Apple Silicon Macs with Metal GPU acceleration.

MLX leverages unified memory architecture on Apple Silicon, allowing seamless data sharing between CPU and GPU. The mlx-lm package provides a simple interface for loading and serving LLMs.

Installation

Install the MLX language model package:

pip install mlx-lm

Basic Usage

The mlx-lm package provides a simple interface for text generation with MLX models. See the Models page for all available MLX models, or browse MLX community models at mlx-community LFM2 models.

from mlx_lm import load, generate

# Load model and tokenizer
model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")

# Generate text
prompt = "What is machine learning?"

# Apply chat template
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
 messages, tokenizer=False, add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, verbose=True)
print(response)

Generation Parameters

Control text generation behavior using parameters in the generate() function. Key parameters:

temperature (float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0
top_p (float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0
top_k (int, default 50): Limits to top-k most probable tokens. Typical range: 1-100
max_tokens (int): Maximum number of tokens to generate
repetition_penalty (float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5

Example with custom parameters:

response = generate(
 model,
 tokenizer,
 prompt=prompt,
 temperature=0.3,
 min_p=0.15,
 repetition_penalty=1.05,
 max_tokens=512
)

Streaming Generation

Stream responses with stream_generate():

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/LFM2-1.2B-8bit")

messages = [{"role": "user", "content": "Tell me a story about space exploration."}]
prompt = tokenizer.apply_chat_template(
 messages, tokenizer=False, add_generation_prompt=True
)

for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=512):
 print(token, end="", flush=True)

Serving with mlx-lm

MLX can serve models through an OpenAI-compatible API. Start a server with:

mlx_lm.server --model mlx-community/LFM2-1.2B-8bit --port 8080

Using the Server

Once running, use the OpenAI Python client:

from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:8080/v1",
 api_key="not-needed"
)

response = client.chat.completions.create(
 model="mlx-community/LFM2-1.2B-8bit",
 messages=[
 {"role": "user", "content": "Explain quantum computing."}
 ],
 temperature=0.3,
 max_tokens=512,
 extra_body={"min_p": 0.15, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)

You can also use curl to interact with the server:

Vision Models

LFM2-VL models support both text and image inputs for multimodal inference. Use mlx_vlm to load and generate with vision models:

Was this page helpful?

Suggest edits Raise issue

LM Studio ONNX

⌘I

URL: https://docs.liquid.ai/deployment/on-device/mlx

⇱ MLX - Liquid Docs

Documentation Index

Installation

Basic Usage

Generation Parameters

Streaming Generation

Serving with mlx-lm

Using the Server

Vision Models

URL: https://docs.liquid.ai/deployment/on-device/mlx

⇱ MLX - Liquid Docs

Documentation Index

​Installation

​Basic Usage

​Generation Parameters

​Streaming Generation

​Serving with mlx-lm

​Using the Server

​Vision Models

Installation

Basic Usage

Generation Parameters

Streaming Generation

Serving with mlx-lm

Using the Server

Vision Models