YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

mT0-XL-detox-orpo

Resources:

Model Information

This is a multilingual 3.7B text detoxification model for 9 languages built on TextDetox 2024 shared task based on mT0-XL. The model was trained in a two-step setup: the first step is full fine-tuning on different parallel text detoxification datasets, and the second step is ORPO alignment on a self-annotated preference dataset collected using toxicity and similarity classifiers. See the paper for more details.

In terms of human evaluation, the model is a second-best approach on the TextDetox 2024 shared task. More precisely, the model shows state-of-the-art performance for the Ukrainian language, top-2 scores for Arabic, and near state-of-the-art performance for other languages.

Example usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained('s-nlp/mt0-xl-detox-orpo', device_map="auto")
tokenizer = AutoTokenizer.from_pretrained('s-nlp/mt0-xl-detox-orpo')

LANG_PROMPTS = {
 'zh': '排毒：',
 'es': 'Desintoxicar: ',
 'ru': 'Детоксифицируй: ',
 'ar': 'إزالة السموم: ',
 'hi': 'विषहरण: ',
 'uk': 'Детоксифікуй: ',
 'de': 'Entgiften: ',
 'am': 'መርዝ መርዝ: ',
 'en': 'Detoxify: ',
}

def detoxify(text, lang, model, tokenizer):
 encodings = tokenizer(LANG_PROMPTS[lang] + text, return_tensors='pt').to(model.device)
 
 outputs = model.generate(**encodings.to(model.device), 
 max_length=128,
 num_beams=10,
 no_repeat_ngram_size=3,
 repetition_penalty=1.2,
 num_beam_groups=5,
 diversity_penalty=2.5,
 num_return_sequences=5,
 early_stopping=True,
 )
 
 return tokenizer.batch_decode(outputs, skip_special_tokens=True)

Citation

@inproceedings{smurfcat_at_pan,
 author = {Elisei Rykov and
 Konstantin Zaytsev and
 Ivan Anisimov and
 Alexandr Voronin},
 editor = {Guglielmo Faggioli and
 Nicola Ferro and
 Petra Galusc{\'{a}}kov{\'{a}} and
 Alba Garc{\'{\i}}a Seco de Herrera},
 title = {SmurfCat at {PAN} 2024 TextDetox: Alignment of Multilingual Transformers
 for Text Detoxification},
 booktitle = {Working Notes of the Conference and Labs of the Evaluation Forum {(CLEF}
 2024), Grenoble, France, 9-12 September, 2024},
 series = {{CEUR} Workshop Proceedings},
 volume = {3740},
 pages = {2866--2871},
 publisher = {CEUR-WS.org},
 year = {2024},
 url = {https://ceur-ws.org/Vol-3740/paper-276.pdf},
 timestamp = {Wed, 21 Aug 2024 22:46:00 +0200},
 biburl = {https://dblp.org/rec/conf/clef/RykovZAV24.bib},
 bibsource = {dblp computer science bibliography, https://dblp.org}
}

Downloads last month: 2,070

Safetensors

Model size

4B params

Tensor type

F32

Datasets used to train s-nlp/mt0-xl-detox-orpo

Paper for s-nlp/mt0-xl-detox-orpo

Paper • 2407.05449 • Published Jul 7, 2024