A model made to continue off my previous work on Magnum 4B, A small model made for creative writing / General assistant tasks, finetuned ontop of IntervitensInc/Llama-3.1-Minitron-4B-Width-Base-chatml, this model is made to be more coherent and generally be better then the 4B at both writing and assistant tasks.
EXL2 quants of Holland 4B, Original weights can be found here
Prompting
Model has been Instruct tuned with the ChatML formatting. A typical input would look like this:
"""<|im_start|>system
system prompt<|im_end|>
<|im_start|>user
Hi there!<|im_end|>
<|im_start|>assistant
Nice to meet you!<|im_end|>
<|im_start|>user
Can I ask a question?<|im_end|>
<|im_start|>assistant
"""
Support
No longer needed - LCPP has merged support, just update
To run inference on this model, you'll need to use Aphrodite, vLLM or EXL 2/tabbyAPI, as llama.cpp hasn't yet merged the required pull request to fix the llama 3.1 rope_freqs issue with custom head dimensions.
However, you can work around this by quantizing the model yourself to create a functional GGUF file. Note that until this PR is merged, the context will be limited to 8 k tokens.
To create a working GGUF file, make the following adjustments:
- Remove the
"rope_scaling": {}entry fromconfig.json - Change
"max_position_embeddings"to8192inconfig.json
These modifications should allow you to use the model with llama. Cpp, albeit with the mentioned context limitation.
Axolotl config
Credits
- anthracite-org/kalo-opus-instruct-22k-no-refusal
- NewEden/Gryphe-3.5-16k-Subset
- Epiculous/Synthstruct-Gens-v1.1-Filtered-n-Cleaned
- lodrick-the-lafted/OpusStories
Training
The training was done for 2 epochs. We used 2 x RTX 6000s GPUs graciously provided by Kubernetes_Bad for the full-parameter fine-tuning of the model.
Safety
...
