llama-cpp-python 0.3.30

pip install llama-cpp-python

Latest release

Released: Jun 16, 2026

Python bindings for the llama.cpp library

Navigation

Verified details

These details have been verified by PyPI

Maintainers

👁 Avatar for abetlen from gravatar.com
abetlen

Unverified details

These details have not been verified by PyPI

Project links

Classifiers

Programming Language

Report project as malware

Project description

👁 Image

Python Bindings for `llama.cpp`

👁 Documentation Status
👁 Tests
👁 PyPI
👁 PyPI - Python Version
👁 PyPI - License
👁 PyPI - Downloads

Simple Python bindings for @ggerganov's llama.cpp library. This package provides:

Low-level access to C API via ctypes interface.
High-level Python API for text completion
- OpenAI-like API
- LangChain compatibility
- LlamaIndex compatibility
OpenAI compatible web server

Documentation is available at https://llama-cpp-python.readthedocs.io/en/latest.

Installation

Requirements:

Python 3.8+
C compiler
- Linux: gcc or clang
- Windows: Visual Studio or MinGW
- MacOS: Xcode

To install the package, run:

pipinstallllama-cpp-python

This will also build llama.cpp from source and install it alongside this python package.

If this fails, add --verbose to the pip install see the full cmake build log.

Pre-built Wheel (New)

It is also possible to install a pre-built wheel with basic CPU support.

pipinstallllama-cpp-python\
--extra-index-urlhttps://abetlen.github.io/llama-cpp-python/whl/cpu

Installation Configuration

llama.cpp supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the llama.cpp README for a full list.

All llama.cpp cmake build options can be set via the CMAKE_ARGS environment variable or via the --config-settings / -C cli flag during installation.

Supported Backends

Below are some common backends, their build commands and any additional environment variables required.

Windows Notes

MacOS Notes

Detailed MacOS Metal GPU install documentation is available at docs/install/macos.md

Upgrading and Reinstalling

To upgrade and rebuild llama-cpp-python add --upgrade --force-reinstall --no-cache-dir flags to the pip install command to ensure the package is rebuilt from source.

High-level API

API Reference

The high-level API provides a simple managed interface through the Llama class.

Below is a short example demonstrating how to use the high-level API for basic text completion:

fromllama_cppimport Llama

llm = Llama(
 model_path="./models/7B/llama-model.gguf",
 # n_gpu_layers=-1, # Uncomment to use GPU acceleration
 # seed=1337, # Uncomment to set a specific seed
 # n_ctx=2048, # Uncomment to increase the context window
)
output = llm(
 "Q: Name the planets in the solar system? A: ", # Prompt
 max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
 stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
 echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)

By default llama-cpp-python generates completions in an OpenAI compatible format:

{
 "id": "cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
 "object": "text_completion",
 "created": 1679561337,
 "model": "./models/7B/llama-model.gguf",
 "choices": [
 {
 "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.",
 "index": 0,
 "logprobs": None,
 "finish_reason": "stop"
 }
 ],
 "usage": {
 "prompt_tokens": 14,
 "completion_tokens": 28,
 "total_tokens": 42
 }
}

Text completion is available through the __call__ and create_completion methods of the Llama class.

Pulling models from Hugging Face Hub

You can download Llama models in gguf format directly from Hugging Face using the from_pretrained method. You'll need to install the huggingface-hub package to use this feature (pip install huggingface-hub).

llm = Llama.from_pretrained(
 repo_id="lmstudio-community/Qwen3.5-0.8B-GGUF",
 filename="*Q8_0.gguf",
 verbose=False
)

By default from_pretrained will download the model to the huggingface cache directory, you can then manage installed model files with the hf tool.

Chat Completion

The high-level API also provides a simple interface for chat completion.

Chat completion requires that the model knows how to format the messages into a single prompt. The Llama class does this using pre-registered chat formats (ie. chatml, llama-2, gemma, etc) or by providing a custom chat handler object.

The model will format the messages into a single prompt using the following order of precedence:

Use the chat_handler if provided
Use the chat_format if provided
Use the tokenizer.chat_template from the gguf model's metadata (should work for most new models, older models may not have this)
else, fallback to the llama-2 chat format

Set verbose=True to see the selected chat format.

fromllama_cppimport Llama
llm = Llama(
 model_path="path/to/llama-2/llama-model.gguf",
 chat_format="llama-2"
)
llm.create_chat_completion(
 messages = [
 {"role": "system", "content": "You are an assistant who perfectly describes images."},
 {
 "role": "user",
 "content": "Describe this image in detail please."
 }
 ]
)

Chat completion is available through the create_chat_completion method of the Llama class.

For OpenAI API v1 compatibility, you use the create_chat_completion_openai_v1 method which will return pydantic models instead of dicts.

JSON and JSON Schema Mode

To constrain chat responses to only valid JSON or a specific JSON Schema use the response_format argument in create_chat_completion.

JSON Mode

The following example will constrain the response to valid JSON strings only.

fromllama_cppimport Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
 messages=[
 {
 "role": "system",
 "content": "You are a helpful assistant that outputs in JSON.",
 },
 {"role": "user", "content": "Who won the world series in 2020"},
 ],
 response_format={
 "type": "json_object",
 },
 temperature=0.7,
)

JSON Schema Mode

To constrain the response further to a specific JSON Schema add the schema to the schema property of the response_format argument.

fromllama_cppimport Llama
llm = Llama(model_path="path/to/model.gguf", chat_format="chatml")
llm.create_chat_completion(
 messages=[
 {
 "role": "system",
 "content": "You are a helpful assistant that outputs in JSON.",
 },
 {"role": "user", "content": "Who won the world series in 2020"},
 ],
 response_format={
 "type": "json_object",
 "schema": {
 "type": "object",
 "properties": {"team_name": {"type": "string"}},
 "required": ["team_name"],
 },
 },
 temperature=0.7,
)

Function Calling

The high-level API supports OpenAI compatible function and tool calling. This is possible through the functionary pre-trained models chat format or through the generic chatml-function-calling chat format.

fromllama_cppimport Llama
llm = Llama(model_path="path/to/chatml/llama-model.gguf", chat_format="chatml-function-calling")
llm.create_chat_completion(
 messages = [
 {
 "role": "system",
 "content": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary"

 },
 {
 "role": "user",
 "content": "Extract Jason is 25 years old"
 }
 ],
 tools=[{
 "type": "function",
 "function": {
 "name": "UserDetail",
 "parameters": {
 "type": "object",
 "title": "UserDetail",
 "properties": {
 "name": {
 "title": "Name",
 "type": "string"
 },
 "age": {
 "title": "Age",
 "type": "integer"
 }
 },
 "required": [ "name", "age" ]
 }
 }
 }],
 tool_choice={
 "type": "function",
 "function": {
 "name": "UserDetail"
 }
 }
)

Multi-modal Models

llama-cpp-python supports such as llava1.5 which allow the language model to read information from both text and images.

Below are the supported multi-modal models and their respective chat handlers (Python API) and chat formats (Server API).

Model	`LlamaChatHandler`	`chat_format`
llava-v1.5-7b	`Llava15ChatHandler`	`llava-1-5`
llava-v1.5-13b	`Llava15ChatHandler`	`llava-1-5`
llava-v1.6-34b	`Llava16ChatHandler`	`llava-1-6`
moondream2	`MoondreamChatHandler`	`moondream2`
nanollava	`NanoLlavaChatHandler`	`nanollava`
llama-3-vision-alpha	`Llama3VisionAlphaChatHandler`	`llama-3-vision-alpha`
minicpm-v-2.6	`MiniCPMv26ChatHandler`	`minicpm-v-2.6`
qwen2.5-vl	`Qwen25VLChatHandler`	`qwen2.5-vl`
gemma-4	`Gemma4ChatHandler`	`gemma4`
GGUF models with an mtmd projector and embedded chat template	`MTMDChatHandler`	`mtmd`

Try Gemma 4 12B in Google Colab -> 👁 Open In Colab

Try Gemma 4 12B QAT in Google Colab -> 👁 Open In Colab

Then you'll need to use a custom chat handler to load the clip model and process the chat messages and images.

fromllama_cppimport Llama
fromllama_cpp.llama_chat_formatimport Llava15ChatHandler
chat_handler = Llava15ChatHandler(clip_model_path="path/to/llava/mmproj.bin")
llm = Llama(
 model_path="./path/to/llava/llama-model.gguf",
 chat_handler=chat_handler,
 n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)
llm.create_chat_completion(
 messages = [
 {"role": "system", "content": "You are an assistant who perfectly describes images."},
 {
 "role": "user",
 "content": [
 {"type" : "text", "text": "What's in this image?"},
 {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }
 ]
 }
 ]
)

You can also pull the model from the Hugging Face Hub using the from_pretrained method.

fromllama_cppimport Llama
fromllama_cpp.llama_chat_formatimport MoondreamChatHandler

chat_handler = MoondreamChatHandler.from_pretrained(
 repo_id="vikhyatk/moondream2",
 filename="*mmproj*",
)

llm = Llama.from_pretrained(
 repo_id="vikhyatk/moondream2",
 filename="*text-model*",
 chat_handler=chat_handler,
 n_ctx=2048, # n_ctx should be increased to accommodate the image embedding
)

response = llm.create_chat_completion(
 messages = [
 {
 "role": "user",
 "content": [
 {"type" : "text", "text": "What's in this image?"},
 {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

 ]
 }
 ]
)
print(response["choices"][0]["text"])

Note: Multi-modal models also support tool calling and JSON mode.

Speculative Decoding

llama-cpp-python supports speculative decoding which allows the model to generate completions based on a draft model.

The fastest way to use speculative decoding is through the LlamaPromptLookupDecoding class.

Just pass this as a draft model to the Llama class during initialization.

fromllama_cppimport Llama
fromllama_cpp.llama_speculativeimport LlamaPromptLookupDecoding

llama = Llama(
 model_path="path/to/model.gguf",
 draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)

Embeddings

To generate text embeddings use create_embedding or embed. Note that you must pass embedding=True to the constructor upon model creation for these to work properly.

importllama_cpp

llm = llama_cpp.Llama(model_path="path/to/model.gguf", embedding=True)

embeddings = llm.create_embedding("Hello, world!")

# or create multiple embeddings at once

embeddings = llm.create_embedding(["Hello, world!", "Goodbye, world!"])

There are two primary notions of embeddings in a Transformer-style model: token level and sequence level. Sequence level embeddings are produced by "pooling" token level embeddings together, usually by averaging them or using the first token.

Models that are explicitly geared towards embeddings will usually return sequence level embeddings by default, one for each input string. Non-embedding models such as those designed for text generation will typically return only token level embeddings, one for each token in each sequence. Thus the dimensionality of the return type will be one higher for token level embeddings.

It is possible to control pooling behavior in some cases using the pooling_type flag on model creation. You can ensure token level embeddings from any model using LLAMA_POOLING_TYPE_NONE. The reverse, getting a generation oriented model to yield sequence level embeddings is currently not possible, but you can always do the pooling manually.

Adjusting the Context Window

The context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.

For instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:

llm = Llama(model_path="./models/7B/llama-model.gguf", n_ctx=2048)

OpenAI Compatible Web Server

llama-cpp-python offers a web server which aims to act as a drop-in replacement for the OpenAI API. This allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).

To install the server package and get started:

pipinstall'llama-cpp-python[server]'
python3-mllama_cpp.server--modelmodels/7B/llama-model.gguf

Similar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:

CMAKE_ARGS="-DGGML_CUDA=on"FORCE_CMAKE=1pipinstall'llama-cpp-python[server]'
python3-mllama_cpp.server--modelmodels/7B/llama-model.gguf--n_gpu_layers35

Navigate to http://localhost:8000/docs to see the OpenAPI documentation.

To bind to 0.0.0.0 to enable remote connections, use python3 -m llama_cpp.server --host 0.0.0.0. Similarly, to change the port (default is 8000), use --port.

You probably also want to set the prompt format. For chatml, use

python3-mllama_cpp.server--modelmodels/7B/llama-model.gguf--chat_formatchatml

That will format the prompt according to how model expects it. You can find the prompt format in the model card. For possible options, see llama_cpp/llama_chat_format.py and look for lines starting with "@register_chat_format".

If you have huggingface-hub installed, you can also use the --hf_model_repo_id flag to load a model from the Hugging Face Hub.

python3-mllama_cpp.server--hf_model_repo_idlmstudio-community/Qwen3.5-0.8B-GGUF--model'*Q8_0.gguf'

Web Server Features

Docker image

A Docker image is available on GHCR. To run the server:

dockerrun--rm-it-p8000:8000-v/path/to/models:/models-eMODEL=/models/llama-model.ggufghcr.io/abetlen/llama-cpp-python:latest

Docker on termux (requires root) is currently the only known way to run this on phones, see termux support issue

Low-level API

API Reference

The low-level API is a direct ctypes binding to the C API provided by llama.cpp. The entire low-level API can be found in llama_cpp/llama_cpp.py and directly mirrors the C API in llama.h.

Below is a short example demonstrating how to use the low-level API to tokenize a prompt:

importllama_cpp
importctypes
llama_cpp.llama_backend_init() # Must be called once at the start of each program
model_params = llama_cpp.llama_model_default_params()
ctx_params = llama_cpp.llama_context_default_params()
prompt = b"Q: Name the planets in the solar system? A: "
# use bytes for char * params
model = llama_cpp.llama_model_load_from_file(b"./models/7b/llama-model.gguf", model_params)
ctx = llama_cpp.llama_init_from_model(model, ctx_params)
vocab = llama_cpp.llama_model_get_vocab(model)
max_tokens = ctx_params.n_ctx
# use ctypes arrays for array params
tokens = (llama_cpp.llama_token * int(max_tokens))()
n_tokens = llama_cpp.llama_tokenize(vocab, prompt, len(prompt), tokens, max_tokens, True, False)
llama_cpp.llama_free(ctx)
llama_cpp.llama_model_free(model)

Check out the examples folder for more examples of using the low-level API.

Documentation

Documentation is available via https://llama-cpp-python.readthedocs.io/. If you find any issues with the documentation, please open an issue or submit a PR.

Development

This package is under active development and I welcome any contributions. See CONTRIBUTING.md for contribution workflow, PR title, changelog, testing, and style guidelines.

To get started, clone the repository and install the package in editable / development mode:

gitclone--recurse-submoduleshttps://github.com/abetlen/llama-cpp-python.git
cdllama-cpp-python

# Upgrade pip (required for editable mode)
pipinstall--upgradepip

# Install with pip
pipinstall-e.

# install development tooling (tests, docs, ruff)
pipinstall-e'.[dev]'

# if you want to use the fastapi / openapi server
pipinstall-e'.[server]'

# to install all optional dependencies
pipinstall-e'.[all]'

# to clear the local build cache
makeclean

Now try running the tests

pytest

And check formatting / linting before opening a PR:

python-mruffcheckllama_cpptests
python-mruffformat--checkllama_cpptests

# or use the Makefile targets
makelint
makeformat

There's a Makefile available with useful targets. A typical workflow would look like this:

makebuild
maketest

You can also test out specific commits of llama.cpp by checking out the desired commit in the vendor/llama.cpp submodule and then running make clean and pip install -e . again. Any changes in the llama.h API will require changes to the llama_cpp/llama_cpp.py file to match the new API (additional changes may be required elsewhere).

FAQ

Are there pre-built binaries / binary wheels available?

The recommended installation method is to install from source as described above. The reason for this is that llama.cpp is built with compiler optimizations that are specific to your system. Using pre-built binaries would require disabling these optimizations or supporting a large number of pre-built binaries for each platform.

That being said there are some pre-built binaries available through the Releases as well as some community provided wheels.

In the future, I would like to provide pre-built binaries and wheels for common platforms and I'm happy to accept any useful contributions in this area. This is currently being tracked in #741

How does this compare to other Python bindings of `llama.cpp`?

I originally wrote this package for my own use with two goals in mind:

Provide a simple process to install llama.cpp and access the full C API in llama.h from Python
Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use llama.cpp

Any contributions and changes to this package will be made with these goals in mind.

License

This project is licensed under the terms of the MIT license.

Project details

Verified details

These details have been verified by PyPI

Maintainers

👁 Avatar for abetlen from gravatar.com
abetlen

Unverified details

These details have not been verified by PyPI

Project links

Classifiers

Programming Language

Release history Release notifications | RSS feed

This version

👁 Image

0.3.30

Jun 16, 2026

👁 Image

0.3.29

Jun 13, 2026

👁 Image

0.3.28

Jun 7, 2026

👁 Image

0.3.27

Jun 7, 2026

👁 Image

0.3.26

Jun 5, 2026

👁 Image

0.3.25

Jun 2, 2026

👁 Image

0.3.24

Jun 1, 2026

👁 Image

0.3.23

May 11, 2026

👁 Image

0.3.22

May 2, 2026

👁 Image

0.3.21

Apr 27, 2026

👁 Image

0.3.20

Apr 3, 2026

👁 Image

0.3.19

Mar 25, 2026

👁 Image

0.3.18

Mar 24, 2026

👁 Image

0.3.17

Mar 23, 2026

👁 Image

0.3.16

Aug 15, 2025

👁 Image

0.3.15

Aug 7, 2025

👁 Image

0.3.14

Jul 18, 2025

👁 Image

0.3.13

Jul 15, 2025

👁 Image

0.3.12

Jul 6, 2025

👁 Image

0.3.11

Jul 5, 2025

👁 Image

0.3.10

Jul 3, 2025

👁 Image

0.3.9

May 8, 2025

👁 Image

0.3.8

Mar 12, 2025

👁 Image

0.3.7

Jan 29, 2025

👁 Image

0.3.6

Jan 8, 2025

👁 Image

0.3.5

Dec 10, 2024

👁 Image

0.3.4

Dec 9, 2024

👁 Image

0.3.3

Dec 9, 2024

👁 Image

0.3.2

Nov 16, 2024

👁 Image

0.3.1

Sep 29, 2024

👁 Image

0.3.0

Sep 25, 2024

👁 Image

0.2.90

Aug 29, 2024

👁 Image

0.2.89

Aug 21, 2024

👁 Image

0.2.88

Aug 13, 2024

👁 Image

0.2.87

Aug 7, 2024

👁 Image

0.2.86

Aug 7, 2024

👁 Image

0.2.85

Jul 31, 2024

👁 Image

0.2.84

Jul 28, 2024

👁 Image

0.2.83

Jul 22, 2024

👁 Image

0.2.82

Jul 9, 2024

👁 Image

0.2.81

Jul 2, 2024

👁 Image

0.2.80

Jul 2, 2024

👁 Image

0.2.79

Jun 19, 2024

👁 Image

0.2.78

Jun 10, 2024

👁 Image

0.2.77

Jun 4, 2024

👁 Image

0.2.76

May 24, 2024

👁 Image

0.2.75

May 16, 2024

👁 Image

0.2.74

May 12, 2024

👁 Image

0.2.73

May 10, 2024

👁 Image

0.2.72

May 10, 2024

👁 Image

0.2.71

May 9, 2024

👁 Image

0.2.70

May 8, 2024

👁 Image

0.2.69

May 2, 2024

👁 Image

0.2.68

Apr 30, 2024

👁 Image

0.2.67

Apr 30, 2024

👁 Image

0.2.66

Apr 30, 2024

👁 Image

0.2.65

Apr 26, 2024

👁 Image

0.2.64

Apr 23, 2024

👁 Image

0.2.63

Apr 20, 2024

👁 Image

0.2.62

Apr 18, 2024

👁 Image

0.2.61

Apr 10, 2024

👁 Image

0.2.60

Apr 6, 2024

👁 Image

0.2.59

Apr 3, 2024

👁 Image

0.2.58

Apr 1, 2024

👁 Image

0.2.57

Mar 18, 2024

👁 Image

0.2.56

Mar 9, 2024

👁 Image

0.2.55

Mar 3, 2024

👁 Image

0.2.54

Mar 1, 2024

👁 Image

0.2.53

Feb 28, 2024

👁 Image

0.2.52

Feb 26, 2024

👁 Image

0.2.51

Feb 26, 2024

👁 Image

0.2.50

Feb 23, 2024

👁 Image

0.2.49

Feb 23, 2024

👁 Image

0.2.48

Feb 23, 2024

👁 Image

0.2.47

Feb 22, 2024

👁 Image

0.2.46

Feb 21, 2024

👁 Image

0.2.45

Feb 21, 2024

👁 Image

0.2.44

Feb 16, 2024

👁 Image

0.2.43

Feb 14, 2024

👁 Image

0.2.42

Feb 13, 2024

👁 Image

0.2.41

Feb 13, 2024

👁 Image

0.2.40

Feb 12, 2024

👁 Image

0.2.39

Feb 6, 2024

👁 Image

0.2.38

Jan 31, 2024

👁 Image

0.2.37

Jan 30, 2024

👁 Image

0.2.36

Jan 29, 2024

👁 Image

0.2.35

Jan 29, 2024

👁 Image

0.2.34

Jan 27, 2024

👁 Image

0.2.33

Jan 25, 2024

👁 Image

0.2.32

Jan 22, 2024

👁 Image

0.2.31

Jan 19, 2024

👁 Image

0.2.30

Jan 19, 2024

👁 Image

0.2.29

Jan 15, 2024

👁 Image

0.2.28

Jan 10, 2024

👁 Image

0.2.27

Jan 4, 2024

👁 Image

0.2.26

Dec 27, 2023

👁 Image

0.2.25

Dec 22, 2023

👁 Image

0.2.24

Dec 18, 2023

👁 Image

0.2.23

Dec 14, 2023

👁 Image

0.2.22

Dec 11, 2023

👁 Image

0.2.20

Nov 28, 2023

👁 Image

0.2.19

Nov 21, 2023

👁 Image

0.2.18

Nov 14, 2023

👁 Image

0.2.17

Nov 10, 2023

👁 Image

0.2.16

Nov 10, 2023

👁 Image

0.2.15

Nov 8, 2023

👁 Image

0.2.14

Nov 6, 2023

👁 Image

0.2.13

Nov 2, 2023

👁 Image

0.2.12

Nov 1, 2023

👁 Image

0.2.11

Sep 30, 2023

👁 Image

0.2.10

Sep 30, 2023

👁 Image

0.2.9

Sep 30, 2023

👁 Image

0.2.8 yanked

Sep 30, 2023

Reason this release was yanked:

Broken build

👁 Image

0.2.7

Sep 25, 2023

👁 Image

0.2.6

Sep 15, 2023

👁 Image

0.2.5

Sep 14, 2023

👁 Image

0.2.4

Sep 14, 2023

👁 Image

0.2.3

Sep 13, 2023

👁 Image

0.2.2

Sep 13, 2023

👁 Image

0.2.1

Sep 13, 2023

👁 Image

0.2.0

Sep 12, 2023

👁 Image

0.1.85

Sep 12, 2023

👁 Image

0.1.84

Sep 9, 2023

👁 Image

0.1.83

Aug 29, 2023

👁 Image

0.1.82

Aug 28, 2023

👁 Image

0.1.81

Aug 27, 2023

👁 Image

0.1.80

Aug 27, 2023

👁 Image

0.1.79

Aug 25, 2023

👁 Image

0.1.78

Aug 18, 2023

👁 Image

0.1.77

Jul 24, 2023

👁 Image

0.1.76

Jul 24, 2023

👁 Image

0.1.74

Jul 20, 2023

👁 Image

0.1.73

Jul 18, 2023

👁 Image

0.1.72

Jul 15, 2023

👁 Image

0.1.71

Jul 14, 2023

👁 Image

0.1.70

Jul 9, 2023

👁 Image

0.1.69

Jul 9, 2023

👁 Image

0.1.68

Jul 5, 2023

👁 Image

0.1.67

Jun 29, 2023

👁 Image

0.1.66

Jun 26, 2023

👁 Image

0.1.65

Jun 20, 2023

👁 Image

0.1.64

Jun 18, 2023

👁 Image

0.1.63

Jun 15, 2023

👁 Image

0.1.62

Jun 10, 2023

👁 Image

0.1.61

Jun 10, 2023

👁 Image

0.1.59

Jun 8, 2023

👁 Image

0.1.57

Jun 1, 2023

👁 Image

0.1.56

May 30, 2023

👁 Image

0.1.55

May 26, 2023

👁 Image

0.1.54

May 23, 2023

👁 Image

0.1.53

May 21, 2023

👁 Image

0.1.52

May 20, 2023

👁 Image

0.1.51

May 19, 2023

👁 Image

0.1.50

May 14, 2023

👁 Image

0.1.49

May 12, 2023

👁 Image

0.1.48

May 8, 2023

👁 Image

0.1.47

May 8, 2023

👁 Image

0.1.46

May 8, 2023

👁 Image

0.1.45

May 8, 2023

👁 Image

0.1.44

May 7, 2023

👁 Image

0.1.43

May 5, 2023

👁 Image

0.1.42

May 4, 2023

👁 Image

0.1.41

May 2, 2023

👁 Image

0.1.40

May 1, 2023

👁 Image

0.1.39

Apr 28, 2023

👁 Image

0.1.38

Apr 25, 2023

👁 Image

0.1.37

Apr 25, 2023

👁 Image

0.1.36

Apr 22, 2023

👁 Image

0.1.35

Apr 20, 2023

👁 Image

0.1.34

Apr 16, 2023

👁 Image

0.1.33

Apr 13, 2023

👁 Image

0.1.32

Apr 10, 2023

👁 Image

0.1.31

Apr 10, 2023

👁 Image

0.1.30

Apr 10, 2023

👁 Image

0.1.29

Apr 10, 2023

👁 Image

0.1.28

Apr 10, 2023

👁 Image

0.1.27

Apr 8, 2023

👁 Image

0.1.26

Apr 8, 2023

👁 Image

0.1.25

Apr 7, 2023

👁 Image

0.1.24

Apr 7, 2023

👁 Image

0.1.23

Apr 5, 2023

👁 Image

0.1.22

Apr 5, 2023

👁 Image

0.1.21

Apr 5, 2023

👁 Image

0.1.20

Apr 4, 2023

👁 Image

0.1.19

Apr 4, 2023

👁 Image

0.1.18

Apr 3, 2023

👁 Image

0.1.17

Apr 3, 2023

👁 Image

0.1.16

Apr 2, 2023

👁 Image

0.1.15

Apr 2, 2023

👁 Image

0.1.14

Apr 2, 2023

👁 Image

0.1.13

Apr 1, 2023

👁 Image

0.1.12

Apr 1, 2023

👁 Image

0.1.11

Apr 1, 2023

👁 Image

0.1.10

Mar 29, 2023

👁 Image

0.1.9

Mar 28, 2023

👁 Image

0.1.8

Mar 28, 2023

👁 Image

0.1.7

Mar 26, 2023

👁 Image

0.1.6

Mar 25, 2023

👁 Image

0.1.5

Mar 25, 2023

👁 Image

0.1.4

Mar 24, 2023

👁 Image

0.1.3

Mar 24, 2023

👁 Image

0.1.2

Mar 24, 2023

👁 Image

0.1.1

Mar 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_cpp_python-0.3.30.tar.gz (70.2 MB view details)

Uploaded Jun 16, 2026 Source

File details

Details for the file llama_cpp_python-0.3.30.tar.gz.

File metadata

Download URL: llama_cpp_python-0.3.30.tar.gz
Upload date: Jun 16, 2026
Size: 70.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llama_cpp_python-0.3.30.tar.gz
Algorithm	Hash digest
SHA256	`798c9b42652d2e0bff5fe81e7e762089f425a99e67f66ffe5ae156957876e0d1`
MD5	`4b3066f77c01e2b29229094f209552db`
BLAKE2b-256	`1e452bca210df9ab2403efbd20316b6f53a2c322fd71a313d8bf090e2ee148e4`

See more details on using hashes here.

URL: https://pypi.org/project/llama-cpp-python/

⇱ llama-cpp-python · PyPI

llama-cpp-python 0.3.30

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Python Bindings for llama.cpp

Installation

Installation Configuration

Supported Backends

Windows Notes

MacOS Notes

Upgrading and Reinstalling

High-level API

Pulling models from Hugging Face Hub

Chat Completion

JSON and JSON Schema Mode

JSON Mode

JSON Schema Mode

Function Calling

Multi-modal Models

Speculative Decoding

Embeddings

Adjusting the Context Window

OpenAI Compatible Web Server

Web Server Features

Docker image

Low-level API

Documentation

Development

FAQ

Are there pre-built binaries / binary wheels available?

How does this compare to other Python bindings of llama.cpp?

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Python Bindings for `llama.cpp`

How does this compare to other Python bindings of `llama.cpp`?