Model Card for Model ID

Model Details

Model Description

This model is a Direct Preference Optimization (DPO) fine-tuned version of the base Qwen2.5 model. The model was trained to produce more truthful and reliable responses using the TruthfulQA dataset.

The training objective is to improve the model’s ability to generate factually correct answers and avoid common misconceptions.

Developed by: Supipi Karunathilaka
Model Type: Causal Language Model
Language(s): English
Finetuned from model [optional]: Qwen/Qwen2.5-0.5B-Instruct

Model Sources

Base Model: (https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)

Uses

This model can be used for:

Question answering
Educational applications
Research on truthful language generation
Benchmark evaluation tasks

Example tasks include answering factual questions while minimizing hallucinations.

Out-of-Scope Use

This model should not be used for:

Legal or medical advice
High-risk decision-making
Real-time safety-critical systems

Bias, Risks, and Limitations

Despite fine-tuning for truthfulness, the model may still:

Generate incorrect information
Reflect biases present in training data
Hallucinate unsupported facts

Users should verify responses when accuracy is critical.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.


from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Karunathilaka123/truthful-qwen-dpo"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Why do humans need sleep?"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained using the TruthfulQA dataset, which contains questions designed to test whether language models produce truthful answers instead of common misconceptions.

Dataset: (jondurbin/truthy-dpo-v0.1)

Training Procedure

The model was fine-tuned using Direct Preference Optimization (DPO).

DPO trains the model using pairs of responses:

Preferred answer (chosen)
Less preferred answer (rejected)

This encourages the model to generate responses closer to the preferred answers.

Training Hyperparameters

Training regime:

Training method: DPO

Epochs: 5

Learning rate: 1e-5

Gradient accumulation steps: 8

Max sequence length: 256

Evaluation

Evaluation can be performed using AlpacaEval or TruthfulQA benchmarks.

The goal is to measure:

Truthfulness
Response quality
Alignment with preferred answers

The winning rate is calculated using the following formula:

The model performance is measured using the Winning Rate metric.

Winning rate is calculated as:

win_rate = (wins + 0.5 × ties) / total × 100

Where:

wins = number of prompts where the model response is preferred
ties = number of prompts where both responses are equally good
total = total number of evaluation prompts

Testing Data, Factors & Metrics

Testing Data

Dataset: (tatsu-lab/alpaca eval)

Results

DPO fine-tuned model performs better than the base model.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Technical Specifications

Model Architecture and Objective

Model Architecture

Base architecture: Transformer Decoder (Qwen2.5)

Training method: Direct Preference Optimization (DPO)

Citation

BibTeX:


@inproceedings{lin2022truthfulqa,
 title={Truthfulqa: Measuring how models mimic human falsehoods},
 author={Lin, Stephanie and Hilton, Jacob and Evans, Owain},
 booktitle={Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers)},
 pages={3214--3252},
 year={2022}
}

@article{rafailov2023direct,
 title={Direct preference optimization: Your language model is secretly a reward model},
 author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea},
 journal={Advances in neural information processing systems},
 volume={36},
 pages={53728--53741},
 year={2023}

@misc{li2023alpacaeval,
 title={Alpacaeval: An automatic evaluator of instruction-following models},
 author={Li, Xuechen and Zhang, Tianyi and Dubois, Yann and Taori, Rohan and Gulrajani, Ishaan and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B},
 year={2023}
}
}

APA:

Lin, S., Hilton, J., & Evans, O. (2022, May). Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 3214-3252).

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36, 53728-53741.

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., ... & Hashimoto, T. B. (2023, May). Alpacaeval: An automatic evaluator of instruction-following models.

Model Card Authors

Supipi Karunathilaka

Downloads last month: 4

Safetensors

Model size

0.5B params

Tensor type

F32

F16

Model tree for Karunathilaka123/truthful-qwen-dpo

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Quantized

(233)

this model

Datasets used to train Karunathilaka123/truthful-qwen-dpo

Paper for Karunathilaka123/truthful-qwen-dpo

Paper • 1910.09700 • Published Oct 21, 2019 • 53

URL: https://huggingface.co/Karunathilaka123/truthful-qwen-dpo

⇱ Karunathilaka123/truthful-qwen-dpo · Hugging Face