VOOZH about

URL: https://huggingface.co/TheBloke/airoboros-7b-gpt4-GGML

โ‡ฑ TheBloke/airoboros-7b-gpt4-GGML ยท Hugging Face


Jon Durbin's Airoboros 7B GPT4 GGML

These files are GGML format model files for Jon Durbin's Airoboros 7B GPT4.

GGML files are for CPU + GPU inference using llama.cpp and libraries and UIs which support this format, such as:

Repositories available

Compatibility

Original llama.cpp quant methods: q4_0, q4_1, q5_0, q5_1, q8_0

I have quantized these 'original' quantisation methods using an older version of llama.cpp so that they remain compatible with llama.cpp as of May 19th, commit 2d5db48.

They should be compatible with all current UIs and libraries that use llama.cpp, such as those listed at the top of this README.

New k-quant methods: q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K

These new quantisation methods are only compatible with llama.cpp as of June 6th, commit 2d43387.

They will NOT be compatible with koboldcpp, text-generation-ui, and other UIs and libraries yet. Support is expected to come over the next few days.

Explanation of the new k-quant methods

The new methods available are:

  • GGML_TYPE_Q2_K - "type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw)
  • GGML_TYPE_Q3_K - "type-0" 3-bit quantization in super-blocks containing 16 blocks, each block having 16 weights. Scales are quantized with 6 bits. This end up using 3.4375 bpw.
  • GGML_TYPE_Q4_K - "type-1" 4-bit quantization in super-blocks containing 8 blocks, each block having 32 weights. Scales and mins are quantized with 6 bits. This ends up using 4.5 bpw.
  • GGML_TYPE_Q5_K - "type-1" 5-bit quantization. Same super-block structure as GGML_TYPE_Q4_K resulting in 5.5 bpw
  • GGML_TYPE_Q6_K - "type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw
  • GGML_TYPE_Q8_K - "type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type.

Refer to the Provided Files table below to see what files use which methods, and how.

Provided files

Name Quant method Bits Size Max RAM required Use case
airoboros-7B.ggmlv3.q2_K.bin q2_K 2 2.80 GB 5.30 GB New k-quant method. Uses GGML_TYPE_Q4_K for the attention.vw and feed_forward.w2 tensors, GGML_TYPE_Q2_K for the other tensors.
airoboros-7B.ggmlv3.q3_K_L.bin q3_K_L 3 3.55 GB 6.05 GB New k-quant method. Uses GGML_TYPE_Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
airoboros-7B.ggmlv3.q3_K_M.bin q3_K_M 3 3.23 GB 5.73 GB New k-quant method. Uses GGML_TYPE_Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else GGML_TYPE_Q3_K
airoboros-7B.ggmlv3.q3_K_S.bin q3_K_S 3 2.90 GB 5.40 GB New k-quant method. Uses GGML_TYPE_Q3_K for all tensors
airoboros-7B.ggmlv3.q4_K_M.bin q4_K_M 4 4.05 GB 6.55 GB New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q4_K
airoboros-7B.ggmlv3.q4_K_S.bin q4_K_S 4 3.79 GB 6.29 GB New k-quant method. Uses GGML_TYPE_Q4_K for all tensors
airoboros-7B.ggmlv3.q5_K_M.bin q5_K_M 5 4.77 GB 7.27 GB New k-quant method. Uses GGML_TYPE_Q6_K for half of the attention.wv and feed_forward.w2 tensors, else GGML_TYPE_Q5_K
airoboros-7B.ggmlv3.q5_K_S.bin q5_K_S 5 4.63 GB 7.13 GB New k-quant method. Uses GGML_TYPE_Q5_K for all tensors
airoboros-7B.ggmlv3.q6_K.bin q6_K 6 5.53 GB 8.03 GB New k-quant method. Uses GGML_TYPE_Q8_K - 6-bit quantization - for all tensors
airoboros-7b-gpt4.ggmlv3.q4_0.bin q4_0 4 3.79 GB 6.29 GB Original llama.cpp quant method, 4-bit.
airoboros-7b-gpt4.ggmlv3.q4_1.bin q4_1 4 4.21 GB 6.71 GB Original llama.cpp quant method, 4-bit. Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
airoboros-7b-gpt4.ggmlv3.q5_0.bin q5_0 5 4.63 GB 7.13 GB Original llama.cpp quant method, 5-bit. Higher accuracy, higher resource usage and slower inference.
airoboros-7b-gpt4.ggmlv3.q5_1.bin q5_1 5 5.06 GB 7.56 GB Original llama.cpp quant method, 5-bit. Even higher accuracy, resource usage and slower inference.
airoboros-7b-gpt4.ggmlv3.q8_0.bin q8_0 8 7.16 GB 9.66 GB Original llama.cpp quant method, 8-bit. Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

Note: the above RAM figures assume no GPU offloading. If layers are offloaded to the GPU, this will reduce RAM usage and use VRAM instead.

How to run in llama.cpp

I use the following command line; adjust for your tastes and needs:

./main -t 10 -ngl 32 -m airoboros-7B.ggmlv3.q5_0.bin --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\n### Response:"

Change -t 10 to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use -t 8.

Change -ngl 32 to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.

If you want to have a chat-style conversation, replace the -p <PROMPT> argument with -i -ins

How to run in text-generation-webui

Further instructions here: text-generation-webui/docs/llama.cpp-models.md.

Discord

For further support, and discussions on these models and AI in general, join us at:

TheBloke AI's Discord server

Thanks, and how to contribute.

Thanks to the chirper.ai team!

I've had a lot of people ask if they can contribute. I enjoy providing models and helping people, and would love to be able to spend even more time doing it, as well as expanding into new projects like fine tuning/training.

If you're able and willing to contribute it will be most gratefully received and will help me to keep providing more models, and to start work on new AI projects.

Donaters will get priority support on any and all AI/LLM/model questions and requests, access to a private Discord room, plus other benefits.

Special thanks to: Luke from CarbonQuill, Aemon Algiz, Dmitriy Samsonov.

Patreon special mentions: Ajan Kanaga, Kalila, Derek Yates, Sean Connelly, Luke, Nathan LeClaire, Trenton Dambrowitz, Mano Prime, David Flickinger, vamX, Nikolai Manek, senxiiz, Khalefa Al-Ahmad, Illia Dulskyi, trip7s trip, Jonathan Leane, Talal Aujan, Artur Olbinski, Cory Kujawski, Joseph William Delisle, Pyrater, Oscar Rangel, Lone Striker, Luke Pendergrass, Eugene Pentland, Johann-Peter Hartmann.

Thank you to all my generous patrons and donaters!

Original model card: Jon Durbin's Airoboros 7B GPT4

Overview

This is a fine-tuned 7b parameter LlaMa model, using completely synthetic training data created gpt4 via https://github.com/jondurbin/airoboros

The dataset used to fine-tune this model is available here, with a specific focus on:

  • trivia
  • math/reasoning (although it still sucks)
  • coding
  • multiple choice and fill-in-the-blank
  • context-obedient question answering
  • theory of mind
  • misc/general

This model was fine-tuned with a fork of FastChat, and therefore uses the standard vicuna template:

USER:
[prompt]

<\s>
ASSISTANT:

The most important bit, to me, is the context obedient question answering support, without extensive prompt engineering.

Note: the example prompt response pairs below are from the 13b model, YMMV with the 7b

Usage

The easiest way to get started is to use my fork of FastChat, which is mostly the same but allows for the increased context length and adds support for multi-line inputs:

pip install git+https://github.com/jondurbin/FastChat

Then, you can invoke it like so (after downloading the model):

python -m fastchat.serve.cli
 --model-path airoboros-7b-gpt4 \
 --temperature 0.5 \
 --max-new-tokens 4096 \
 --context-length 4096 \
 --conv-template vicuna_v1.1 \
 --no-history

Context obedient question answering

By obedient, I mean the model was trained to ignore what it thinks it knows, and uses the context to answer the question. The model was also tuned to limit the values to the provided context as much as possible to reduce hallucinations.

The format for a closed-context prompt is as follows:

BEGININPUT
BEGINCONTEXT
url: https://some.web.site/123
date: 2023-06-01
... other metdata ...
ENDCONTEXT
[insert your text blocks here]
ENDINPUT
[add as many other blocks, in the exact same format]
BEGININSTRUCTION
[insert your instruction(s). The model was tuned with single questions, paragraph format, lists, etc.]
ENDINSTRUCTION

Here's a trivial, but important example to prove the point:

BEGININPUT
BEGINCONTEXT
date: 2021-01-01
url: https://web.site/123
ENDCONTEXT
In a shocking turn of events, blueberries are now green, but will be sticking with the same name.
ENDINPUT
BEGININSTRUCTION
What color are bluberries? Source?
ENDINSTRUCTION

And the response:

Blueberries are now green.
Source:
date: 2021-01-01
url: https://web.site/123

Other stuff

Better coding

Example:

Create a python application with the following requirements:
- Asyncio FastAPI webserver
- ping endpoint that returns the current date in JSON format
- file upload endpoint, which calculates the file's sha256 checksum, and checks postgres to deduplicate

Word games / trivia

USER: This anagram of "silent" is what you do when paying attention and hearing someone.
ASSISTANT: listen
USER: Give me a list of 7 words that start with EN
ASSISTANT: 1. Endlessly
2. Enthusiastically
3. Enormously
4. Enveloping
5. Enduring
6. Enlightening
7. Encouraging

Multiple choice

USER: Which of the following astronomical objects is composed primarily of ice and rock, and is located in the Kuiper Belt?

A) Mercury
B) Venus
C) Pluto
D) Jupiter
ASSISTANT: [C, Pluto]

Writing

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support