VOOZH about

URL: https://wiki.archlinux.org/title/Llama.cpp

⇱ Llama.cpp - ArchWiki


Jump to content
From ArchWiki

Large language model (LLM) inference in C/C++.

Installation

llama.cpp is available in the AUR:

Alternatively, follow the instructions on llama.app.

Usage

The primary executor is llama.

llama cli

llama cli returns an interactive prompt in command-line:

$ llama cli -m model.gguf

llama serve

llama serve launches an API server with a built-in WebUI:

$ llama serve --host address --port port -m model.gguf

llama help

llama help returns a list available commands:

$ llama help

Obtaining models

llama.cpp uses models in the GGUF format.

Download from Hugging Face

Download models from Hugging Face using the -hf flag:

$ llama cli -hf org/model
Warning This may overwrite an existing model file without prompting.

Manual download

Models can be downloaded manually using a full URL and a downloader such as wget or curl.

Tips and tricks

Model quantization

Quantization lowers model precision to reduce memory usage.

GGUF models use suffixes to indicate quantization level. Generally, lower numbers (e.g. Q4) use less memory but may reduce quality compared to higher numbers (e.g. Q8).

Knowledge distillation

Knowledge distillation compresses a larger model into a smaller model by training the smaller model to follow the behaviors of the larger model.

Typically, GGUF models indicate knowledge distillation using the student-teacher-distill denotation, where:

  • student represents the smaller model.
  • teacher represents the larger model.

Specifying context size

llama.cpp loads the context size from the model by default, and it allocates memory for the whole context window.

Specify a lower context size in case you run out of memory.

$ llama cli -c 32000 -m model.gguf

Key-value cache quantization

For further memory efficiency, you can quantize the key-value cache.

$ llama cli -ctk q8_0 -ctv q8_0 -m model.gguf

This, combined with a lower context size, can significantly reduce memory usage.

Note
  • Aggressive quantization on keys reduces quality noticeably.
  • Aggressive quantization on values is usually better tolerated, but still risks degradation.

Agent system

While the API server runs a WebUI, the same endpoint also operates as an OpenAI-compatible server. It can be configured to use with a coding agent like opencode.

Alternatively, the WebUI has introduced built-in agent capabilities.

Built-in tools

To enable built-in tools for filesystem operations and shell access, start llama-server with:

$ llama serve --tools all -m model.gguf

This, combined with a reasonably strong reasoning model, can be considered as a minimal coding agent running in web browser.

Warning

Be very aware, that all interactions are submitted to the operating system on the behalf of whoever is running llama-server. At no time should the API server be exposed to the network and/or running as root with built-in tools enabled!

Model Context Protocol servers

Other tools (e.g. web_search, fetch) can be integrated to the WebUI, given that the tools are served as MCP endpoints.

Monitoring GPU utilization

See Graphics processing unit#Monitoring.

Troubleshooting

MCP requests denied by CORS policy

To use the WebUI with an MCP endpoint hosted online, enable MCP CORS proxy:

$ llama serve --ui-mcp-proxy -m model.gguf

See also