VOOZH about

URL: https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF

⇱ TheBloke/Mixtral-8x7B-v0.1-GGUF · Hugging Face


TheBloke's LLM work is generously supported by a grant from andreessen horowitz (a16z)


Mixtral 8X7B v0.1 - GGUF

Description

This repo contains GGUF format model files for Mistral AI_'s Mixtral 8X7B v0.1.

About GGUF

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp.

Mixtral GGUF

Support for Mixtral was merged into Llama.cpp on December 13th.

These Mixtral GGUFs are known to work in:

  • llama.cpp as of December 13th
  • KoboldCpp 1.52 as later
  • LM Studio 0.2.9 and later
  • llama-cpp-python 0.2.23 and later

Other clients/libraries, not listed above, may not yet work.

Repositories available

Prompt template: None

{prompt}

Compatibility

These Mixtral GGUFs are compatible with llama.cpp from December 13th onwards. Other clients/libraries may not work yet.

Explanation of quantisation methods

Provided files

Name Quant method Bits Size Max RAM required Use case
mixtral-8x7b-v0.1.Q2_K.gguf Q2_K 2 15.64 GB 18.14 GB smallest, significant quality loss - not recommended for most purposes
mixtral-8x7b-v0.1.Q3_K_M.gguf Q3_K_M 3 20.36 GB 22.86 GB very small, high quality loss
mixtral-8x7b-v0.1.Q4_0.gguf Q4_0 4 26.44 GB 28.94 GB legacy; small, very high quality loss - prefer using Q3_K_M
mixtral-8x7b-v0.1.Q4_K_M.gguf Q4_K_M 4 26.44 GB 28.94 GB medium, balanced quality - recommended
mixtral-8x7b-v0.1.Q5_0.gguf Q5_0 5 32.23 GB 34.73 GB legacy; medium, balanced quality - prefer using Q4_K_M
mixtral-8x7b-v0.1.Q5_K_M.gguf Q5_K_M 5 32.23 GB 34.73 GB large, very low quality loss - recommended
mixtral-8x7b-v0.1.Q6_K.gguf Q6_K 6 38.38 GB 40.88 GB very large, extremely low quality loss
mixtral-8x7b-v0.1.Q8_0.gguf Q8_0 8 49.62 GB 52.12 GB very large, extremely low quality loss - not recommended

Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

How to download GGUF files

Note for manual downloaders: You almost never want to clone the entire repo! Multiple different quantisation formats are provided, and most users only want to pick and download a single file.

The following clients/libraries will automatically download models for you, providing a list of available models to choose from:

  • LM Studio
  • LoLLMS Web UI
  • Faraday.dev

In text-generation-webui

Under Download Model, you can enter the model repo: TheBloke/Mixtral-8x7B-v0.1-GGUF and below it, a specific filename to download, such as: mixtral-8x7b-v0.1.Q4_K_M.gguf.

Then click Download.

On the command line, including multiple files at once

I recommend using the huggingface-hub Python library:

pip3 install huggingface-hub

Then you can download any individual model file to the current directory, at high speed, with a command like this:

huggingface-cli download TheBloke/Mixtral-8x7B-v0.1-GGUF mixtral-8x7b-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Example llama.cpp command

Make sure you are using llama.cpp from commit d0cee0d or later.

./main -ngl 35 -m mixtral-8x7b-v0.1.Q4_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "{prompt}"

Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

Change -c 2048 to the desired sequence length. For extended sequence models - eg 8K, 16K, 32K - the necessary RoPE scaling parameters are read from the GGUF file and set by llama.cpp automatically. Note that longer sequence lengths require much more resources, so you may need to reduce this value.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins

For other parameters and how to use them, please refer to the llama.cpp documentation

How to run in text-generation-webui

Note that text-generation-webui may not yet be compatible with Mixtral GGUFs. Please check compatibility first.

Further instructions can be found in the text-generation-webui documentation, here: text-generation-webui/docs/04 ‐ Model Tab.md.

How to run from Python code

You can use GGUF models from Python using the llama-cpp-python version 0.2.23 and later.

How to load this model in Python code, using llama-cpp-python

For full documentation, please see: llama-cpp-python docs.

First install the package

Run one of the following commands, according to your system:

# Base ctransformers with no GPU acceleration
pip install llama-cpp-python
# With NVidia CUDA acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
# Or with OpenBLAS acceleration
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
# Or with CLBLast acceleration
CMAKE_ARGS="-DLLAMA_CLBLAST=on" pip install llama-cpp-python
# Or with AMD ROCm GPU acceleration (Linux only)
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
# Or with Metal GPU acceleration for macOS systems only
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python

# In windows, to set the variables CMAKE_ARGS in PowerShell, follow this format; eg for NVidia CUDA:
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
pip install llama-cpp-python

Simple llama-cpp-python example code

from llama_cpp import Llama

# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
llm = Llama(
 model_path="./mixtral-8x7b-v0.1.Q4_K_M.gguf", # Download the model file first
 n_ctx=2048, # The max sequence length to use - note that longer sequence lengths require much more resources
 n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
 n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
)

# Simple inference example
output = llm(
 "{prompt}", # Prompt
 max_tokens=512, # Generate up to 512 tokens
 stop=["</s>"], # Example stop token - not necessarily correct for this specific model! Please check before using.
 echo=True # Whether to echo the prompt
)

# Chat Completion API

llm = Llama(model_path="./mixtral-8x7b-v0.1.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
llm.create_chat_completion(
 messages = [
 {"role": "system", "content": "You are a story writing assistant."},
 {
 "role": "user",
 "content": "Write a story about llamas."
 }
 ]
)

How to use with LangChain

Here are guides on using llama-cpp-python and ctransformers with LangChain:

Discord

For further support, and discussions on these models and AI in general, join us at:

TheBloke AI's Discord server

Thanks, and how to contribute

Thanks to the chirper.ai team!

Thanks to Clay from gpus.llm-utils.org!

I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.

If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.

Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.

Special thanks to: Aemon Algiz.

Patreon special mentions: Michael Levine, 阿明, Trailburnt, Nikolai Manek, John Detwiler, Randy H, Will Dee, Sebastain Graf, NimbleBox.ai, Eugene Pentland, Emad Mostaque, Ai Maven, Jim Angel, Jeff Scroggin, Michael Davis, Manuel Alberto Morcote, Stephen Murray, Robert, Justin Joy, Luke @flexchar, Brandon Frisco, Elijah Stavena, S_X, Dan Guido, Undi ., Komninos Chatzipapas, Shadi, theTransient, Lone Striker, Raven Klaugh, jjj, Cap'n Zoog, Michel-Marie MAUDET (LINAGORA), Matthew Berman, David, Fen Risland, Omer Bin Jawed, Luke Pendergrass, Kalila, OG, Erik Bjäreholt, Rooh Singh, Joseph William Delisle, Dan Lewis, TL, John Villwock, AzureBlack, Brad, Pedro Madruga, Caitlyn Gatomon, K, jinyuan sun, Mano Prime, Alex, Jeffrey Morgan, Alicia Loh, Illia Dulskyi, Chadd, transmissions 11, fincy, Rainer Wilmers, ReadyPlayerEmma, knownsqashed, Mandus, biorpg, Deo Leter, Brandon Phillips, SuperWojo, Sean Connelly, Iucharbius, Jack West, Harry Royden McLaughlin, Nicholas, terasurfer, Vitor Caleffi, Duane Dunston, Johann-Peter Hartmann, David Ziegler, Olakabola, Ken Nordquist, Trenton Dambrowitz, Tom X Nguyen, Vadim, Ajan Kanaga, Leonard Tan, Clay Pascal, Alexandros Triantafyllidis, JM33133, Xule, vamX, ya boyyy, subjectnull, Talal Aujan, Alps Aficionado, wassieverse, Ari Malik, James Bentley, Woland, Spencer Kim, Michael Dempsey, Fred von Graf, Elle, zynix, William Richards, Stanislav Ovsiannikov, Edmond Seymore, Jonathan Leane, Martin Kemka, usrbinkat, Enrico Ros

Thank you to all my generous patrons and donaters!

And thank you again to a16z for their generous grant.

Original model card: Mistral AI_'s Mixtral 8X7B v0.1

Model Card for Mixtral-8x7B

The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. The Mistral-8x7B outperforms Llama 2 70B on most benchmarks we tested.

For full details of this model please read our release blog post.

Warning

This repo contains weights that are compatible with vLLM serving of the model as well as Hugging Face transformers library. It is based on the original Mixtral , but the file format and parameter names are different. Please note that model cannot (yet) be instantiated with HF.

Run the model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

By default, transformers will load the model in full precision. Therefore you might be interested to further reduce down the memory requirements to run the model through the optimizations we offer in HF ecosystem:

In half-precision

Note float16 precision only works on GPU devices

Lower precision using (8-bit & 4-bit) using bitsandbytes

Load the model with Flash Attention 2

Notice

Mixtral-8x7B is a pretrained base model and therefore does not have any moderation mechanisms.

The Mistral AI Team

Albert Jiang, Alexandre Sablayrolles, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Louis Ternon, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed.

Downloads last month
4,799
GGUF
Model size
47B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheBloke/Mixtral-8x7B-v0.1-GGUF

Quantized
(41)
this model

Spaces using TheBloke/Mixtral-8x7B-v0.1-GGUF 3