Model Card for Model ID
Model Details
Model Description
This model is a Direct Preference Optimization (DPO) fine-tuned version of the base Qwen2.5 model. The model was trained to produce more truthful and reliable responses using the TruthfulQA dataset.
The training objective is to improve the model’s ability to generate factually correct answers and avoid common misconceptions.
- Developed by: Supipi Karunathilaka
- Model Type: Causal Language Model
- Language(s): English
- Finetuned from model [optional]: Qwen/Qwen2.5-0.5B-Instruct
Model Sources
- Base Model: (https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
Uses
This model can be used for:
Question answering
Educational applications
Research on truthful language generation
Benchmark evaluation tasks
Example tasks include answering factual questions while minimizing hallucinations.
Out-of-Scope Use
This model should not be used for:
Legal or medical advice
High-risk decision-making
Real-time safety-critical systems
Bias, Risks, and Limitations
Despite fine-tuning for truthfulness, the model may still:
Generate incorrect information
Reflect biases present in training data
Hallucinate unsupported facts
Users should verify responses when accuracy is critical.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Karunathilaka123/truthful-qwen-dpo"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
prompt = "Why do humans need sleep?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
The model was trained using the TruthfulQA dataset, which contains questions designed to test whether language models produce truthful answers instead of common misconceptions.
Dataset: (jondurbin/truthy-dpo-v0.1)
Training Procedure
The model was fine-tuned using Direct Preference Optimization (DPO).
DPO trains the model using pairs of responses:
Preferred answer (chosen)
Less preferred answer (rejected)
This encourages the model to generate responses closer to the preferred answers.
Training Hyperparameters
- Training regime:
Training method: DPO
Epochs: 5
Learning rate: 1e-5
Gradient accumulation steps: 8
Max sequence length: 256
Evaluation
Evaluation can be performed using AlpacaEval or TruthfulQA benchmarks.
The goal is to measure:
Truthfulness
Response quality
Alignment with preferred answers
The winning rate is calculated using the following formula:
The model performance is measured using the Winning Rate metric.
Winning rate is calculated as:
win_rate = (wins + 0.5 × ties) / total × 100
Where:
- wins = number of prompts where the model response is preferred
- ties = number of prompts where both responses are equally good
- total = total number of evaluation prompts
Testing Data, Factors & Metrics
Testing Data
Dataset: (tatsu-lab/alpaca eval)
Results
DPO fine-tuned model performs better than the base model.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
Technical Specifications
Model Architecture and Objective
Model Architecture
Base architecture: Transformer Decoder (Qwen2.5)
Training method: Direct Preference Optimization (DPO)
Citation
BibTeX:
@inproceedings{lin2022truthfulqa,
title={Truthfulqa: Measuring how models mimic human falsehoods},
author={Lin, Stephanie and Hilton, Jacob and Evans, Owain},
booktitle={Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers)},
pages={3214--3252},
year={2022}
}
@article{rafailov2023direct,
title={Direct preference optimization: Your language model is secretly a reward model},
author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea},
journal={Advances in neural information processing systems},
volume={36},
pages={53728--53741},
year={2023}
@misc{li2023alpacaeval,
title={Alpacaeval: An automatic evaluator of instruction-following models},
author={Li, Xuechen and Zhang, Tianyi and Dubois, Yann and Taori, Rohan and Gulrajani, Ishaan and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B},
year={2023}
}
}
APA:
Lin, S., Hilton, J., & Evans, O. (2022, May). Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 3214-3252).
Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36, 53728-53741.
Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., ... & Hashimoto, T. B. (2023, May). Alpacaeval: An automatic evaluator of instruction-following models.
Model Card Authors
Supipi Karunathilaka
- Downloads last month
- 4
