Use llama.cpp for CPU-only environments, local development, or edge deployment and on-device inference.
Installation
- macOS/Linux
- Pre-built Binaries
- Build from Source
Install via Homebrew:
brew install llama.cpp
Downloading GGUF Models
llama.cpp uses the GGUF format, which stores quantized model weights for efficient inference. All LFM models are available in GGUF format on Hugging Face. See the Models page for all available GGUF models. You can download LFM models in GGUF format from Hugging Face as follows:uv pip install huggingface-hub
hf download LiquidAI/LFM2.5-1.2B-Instruct-GGUF lfm2.5-1.2b-instruct-q4_k_m.gguf --local-dir .
Basic Usage
llama.cpp offers two main interfaces for running inference:llama-server (OpenAI-compatible server) and llama-cli (interactive CLI).
- llama-server
- llama-cli
llama-server provides an OpenAI-compatible API for serving models locally.Starting the Server:The Key parameters:Using curl:
llama-server -hf LiquidAI/LFM2.5-1.2B-Instruct-GGUF -c 4096 --port 8080
-hf flag downloads the model directly from Hugging Face. Alternatively, use a local model file:llama-server -m lfm2.5-1.2b-instruct-q4_k_m.gguf -c 4096 --port 8080
-hf: Hugging Face model ID (downloads automatically)-m: Path to local GGUF model file-c: Context length (default: 4096)--port: Server port (default: 8080)-ngl 99: Offload layers to GPU (if available)
http://localhost:8080, use the OpenAI Python client:from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="lfm2.5-1.2b-instruct",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
temperature=0.1,
max_tokens=512,
extra_body={"top_k": 50, "repetition_penalty": 1.05},
)
print(response.choices[0].message.content)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "lfm2.5-1.2b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.1,
"top_k": 50,
"repetition_penalty": 1.05
}'
Generation Parameters
Control text generation behavior using parameters in the OpenAI-compatible API or command-line flags. Key parameters:temperature(float): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0top_p(float): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0top_k(int): Limits to top-k most probable tokens. Typical range: 1-100min_p(float): Filters tokens belowmin_p * max_probability. Typical range: 0.05-0.3max_tokens/--n-predict(int): Maximum number of tokens to generaterepetition_penalty/--repeat-penalty(float): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5stop(strorlist[str]): Strings that terminate generation when encountered
llama-cli), use flags like --temp, --top-p, --top-k, --min-p, --repeat-penalty, and --n-predict.
Vision Models
LFM2-VL GGUF models can be used for multimodal inference with llama.cpp.Quick Start with llama-cli
Download llama.cpp binaries and run vision inference directly:wget https://github.com/ggml-org/llama.cpp/releases/download/b7633/llama-b7633-bin-ubuntu-x64.tar.gz
tar -xzf llama-b7633-bin-ubuntu-x64.tar.gz
import requests
image_url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
img_data = requests.get(image_url).content
with open("test_image.jpg", "wb") as f:
f.write(img_data)
llama-b7633/llama-cli \
-hf LiquidAI/LFM2.5-VL-1.6B-GGUF:Q4_0 \
--image test_image.jpg \
--image-max-tokens 64 \
-p "What's in this image?" \
-n 128 \
--temp 0.1 --min-p 0.15 --repeat-penalty 1.05
-hf flag downloads the model directly from Hugging Face. Use --image-max-tokens to control image token budget.
Alternative: Manual Model Download
If you prefer to download models manually:uv pip install huggingface-hub
hf download LiquidAI/LFM2-VL-1.6B-GGUF LFM2-VL-1.6B-Q8_0.gguf --local-dir .
hf download LiquidAI/LFM2-VL-1.6B-GGUF mmproj-LFM2-VL-1.6B-Q8_0.gguf --local-dir .
For a complete working example with step-by-step instructions, see the llama.cpp Vision Model Colab notebook.
Converting Custom Models
If you have a finetuned model or need to create a GGUF from a Hugging Face model:# Clone llama.cpp if you haven't already
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert model with quantization
python convert_hf_to_gguf.py /path/to/your/model --outfile model.gguf --outtype q4_k_m
--outtype to specify the quantization level (e.g., q4_0, q4_k_m, q5_k_m, q6_k, q8_0, f16).
