Magistrate 3.2 3B
Continued pretraining applied to meta-llama/Llama-3.2-3B using no synthetic legal data. ~250M tokens.
The model achieves the following results on the evaluation set:
- Loss: 0.6802
Instruct version is available
Model description
This is a base model trained on US Supreme Court proceedings, US federal code and regulations.
Intended uses & limitations
This model is intended for research purposes. You are liable for all model outputs.
Training and evaluation data
The training data consists of US Supreme Court verdicts, federal regulations, laws and treaties.
Some other resources have been included from institutions like CLL to fill in the gaps in knowledge for industry jargon.
Training procedure
Spectrum top 35% fine tune. Thanks to the cognitive computations team for the work done on spectrum.
Methodology based on Cohere's paper: To Code, or Not To Code? Exploring Impact of Code in Pre-training
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 4
- total_train_batch_size: 16
- total_eval_batch_size: 4
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 690
- num_epochs: 3
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 1.3589 | 0.0004 | 1 | 1.5640 |
| 0.9936 | 0.4984 | 1154 | 0.9440 |
| 0.8384 | 0.9968 | 2308 | 0.8392 |
| 0.8226 | 1.4963 | 3462 | 0.7802 |
| 0.6568 | 1.9949 | 4616 | 0.7059 |
| 0.5163 | 2.4923 | 5770 | 0.6886 |
| 0.492 | 2.9922 | 6924 | 0.6802 |
Framework versions
- Transformers 4.45.0
- Pytorch 2.3.1+cu121
- Datasets 2.21.0
- Tokenizers 0.20.0
- Downloads last month
- 10
Model tree for macadeliccc/magistrate-3.2-3b-base
Base model
meta-llama/Llama-3.2-3B