VOOZH about

URL: https://huggingface.co/Karunathilaka123/truthful-qwen-dpo

⇱ Karunathilaka123/truthful-qwen-dpo · Hugging Face


Model Card for Model ID

Model Details

Model Description

This model is a Direct Preference Optimization (DPO) fine-tuned version of the base Qwen2.5 model. The model was trained to produce more truthful and reliable responses using the TruthfulQA dataset.

The training objective is to improve the model’s ability to generate factually correct answers and avoid common misconceptions.

  • Developed by: Supipi Karunathilaka
  • Model Type: Causal Language Model
  • Language(s): English
  • Finetuned from model [optional]: Qwen/Qwen2.5-0.5B-Instruct

Model Sources

Uses

This model can be used for:

  • Question answering

  • Educational applications

  • Research on truthful language generation

  • Benchmark evaluation tasks

Example tasks include answering factual questions while minimizing hallucinations.

Out-of-Scope Use

This model should not be used for:

  • Legal or medical advice

  • High-risk decision-making

  • Real-time safety-critical systems

Bias, Risks, and Limitations

Despite fine-tuning for truthfulness, the model may still:

  • Generate incorrect information

  • Reflect biases present in training data

  • Hallucinate unsupported facts

Users should verify responses when accuracy is critical.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.


from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Karunathilaka123/truthful-qwen-dpo"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

prompt = "Why do humans need sleep?"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

The model was trained using the TruthfulQA dataset, which contains questions designed to test whether language models produce truthful answers instead of common misconceptions.

Dataset: (jondurbin/truthy-dpo-v0.1)

Training Procedure

The model was fine-tuned using Direct Preference Optimization (DPO).

DPO trains the model using pairs of responses:

  • Preferred answer (chosen)

  • Less preferred answer (rejected)

This encourages the model to generate responses closer to the preferred answers.

Training Hyperparameters

  • Training regime:

Training method: DPO

Epochs: 5

Learning rate: 1e-5

Gradient accumulation steps: 8

Max sequence length: 256

Evaluation

Evaluation can be performed using AlpacaEval or TruthfulQA benchmarks.

The goal is to measure:

  • Truthfulness

  • Response quality

  • Alignment with preferred answers

The winning rate is calculated using the following formula:

The model performance is measured using the Winning Rate metric.

Winning rate is calculated as:

win_rate = (wins + 0.5 × ties) / total × 100

Where:

  • wins = number of prompts where the model response is preferred
  • ties = number of prompts where both responses are equally good
  • total = total number of evaluation prompts

Testing Data, Factors & Metrics

Testing Data

Dataset: (tatsu-lab/alpaca eval)

Results

DPO fine-tuned model performs better than the base model.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Technical Specifications

Model Architecture and Objective

Model Architecture

Base architecture: Transformer Decoder (Qwen2.5)

Training method: Direct Preference Optimization (DPO)

Citation

BibTeX:


@inproceedings{lin2022truthfulqa,
 title={Truthfulqa: Measuring how models mimic human falsehoods},
 author={Lin, Stephanie and Hilton, Jacob and Evans, Owain},
 booktitle={Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers)},
 pages={3214--3252},
 year={2022}
}

@article{rafailov2023direct,
 title={Direct preference optimization: Your language model is secretly a reward model},
 author={Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea},
 journal={Advances in neural information processing systems},
 volume={36},
 pages={53728--53741},
 year={2023}

@misc{li2023alpacaeval,
 title={Alpacaeval: An automatic evaluator of instruction-following models},
 author={Li, Xuechen and Zhang, Tianyi and Dubois, Yann and Taori, Rohan and Gulrajani, Ishaan and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B},
 year={2023}
}
}

APA:

Lin, S., Hilton, J., & Evans, O. (2022, May). Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 3214-3252).

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36, 53728-53741.

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., ... & Hashimoto, T. B. (2023, May). Alpacaeval: An automatic evaluator of instruction-following models.

Model Card Authors

Supipi Karunathilaka

Downloads last month
4
Safetensors
Model size
0.5B params
Tensor type
F32
·
F16
·
U8
·

Model tree for Karunathilaka123/truthful-qwen-dpo

Quantized
(233)
this model

Datasets used to train Karunathilaka123/truthful-qwen-dpo

Paper for Karunathilaka123/truthful-qwen-dpo