VOOZH about

URL: https://huggingface.co/tpo-alignment/Mistral-Instruct-7B-TPO-y4

⇱ tpo-alignment/Mistral-Instruct-7B-TPO-y4 · Hugging Face


Mistral-Instruct-7B-TPO-y4 Model Card

TPO (Triple Preference Optimization) is a novel preference optimization algorithm aimed at enhancing the instruction-following and reasoning capabilities of large language models through a one-step optimization process. Additionally, we introduce TPO-L, a length-controlled variant of TPO that significantly boosts performance by incorporating a reward margin into TPO’s structure. For more details, refer to our preprint and GitHub repository.

Model Details

Model Description

We fine-tuned mistralai/Mistral-7B-Instruct-v0.2 on princeton-nlp/mistral-instruct-ultrafeedback with the TPO objective. For fine-tuning, we selected the highest-scoring response as the gold response, the fourth-best response as the preferred response, and the lowest-scoring response as the rejected response.

  • Developed by: Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral
  • Model type: Causal Language Model
  • License: mistral
  • Finetuned from model: mistralai/Mistral-7B-Instruct-v0.2

Model Sources

How to Get Started with the Model

import torch
from transformers import pipeline
model_id = "tpo-alignment/Mistral-Instruct-7B-TPO-y4"
generator = pipeline(
 "text-generation",
 model=model_id,
 model_kwargs={"torch_dtype": torch.bfloat16},
 device="cuda",
)
outputs = generator([{"role": "user", "content": "What's the difference between llamas and alpacas?"}],
 do_sample=False,
 eos_token_id=[generator.tokenizer.convert_tokens_to_ids("<end_of_turn>"), generator.tokenizer.eos_token_id],
 max_new_tokens=200)
print(outputs[0]['generated_text'])

Training Details

Training Data

We use princeton-nlp/mistral-instruct-ultrafeedback as the preference optimization dataset.

Training Hyperparameters

The hyperparameters used can be found in the repository.

Technical Specifications

Model Architecture and Objective

The model architecture is based on mistralai/Mistral-7B-Instruct-v0.2. We use the TPO training objective proposed in our preprint.

Hardware

We used 8xA100 GPUs for model training.

Citation

TPO paper:

@misc{saeidi2025triplepreferenceoptimizationachieving,
 title={Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization}, 
 author={Amir Saeidi and Shivanshu Verma and Aswin RRV and Kashif Rasul and Chitta Baral},
 year={2025},
 eprint={2405.16681},
 archivePrefix={arXiv},
 primaryClass={cs.CL},
 url={https://arxiv.org/abs/2405.16681}, 
}
Downloads last month
4
Safetensors
Model size
7B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tpo-alignment/Mistral-Instruct-7B-TPO-y4

Finetuned
(1093)
this model

Dataset used to train tpo-alignment/Mistral-Instruct-7B-TPO-y4

Paper for tpo-alignment/Mistral-Instruct-7B-TPO-y4