Catalan BERTa (roberta-base-ca) finetuned for Semantic Textual Similarity.

Model description

The roberta-base-ca-cased-sts is a Semantic Textual Similarity (STS) model for the Catalan language fine-tuned from the roberta-base-ca model, a RoBERTa base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers.

Intended uses and limitations

roberta-base-ca-cased-sts model can be used to assess the similarity between two snippets of text. The model is limited by its training dataset and may not generalize well for all use cases.

How to use

To get the correct¹ model's prediction scores with values between 0.0 and 5.0, use the following code:

from transformers import pipeline, AutoTokenizer
from scipy.special import logit

model = 'projecte-aina/roberta-base-ca-cased-sts'
tokenizer = AutoTokenizer.from_pretrained(model)
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer)

def prepare(sentence_pairs):
 sentence_pairs_prep = []
 for s1, s2 in sentence_pairs:
 sentence_pairs_prep.append(f"{tokenizer.cls_token} {s1}{tokenizer.sep_token}{tokenizer.sep_token} {s2}{tokenizer.sep_token}")
 return sentence_pairs_prep

sentence_pairs = [("El llibre va caure per la finestra.", "El llibre va sortir volant."),
 ("M'agrades.", "T'estimo."),
 ("M'agrada el sol i la calor", "A la Garrotxa plou molt.")]

predictions = pipe(prepare(sentence_pairs), add_special_tokens=False)

# convert back to scores to the original 0 and 5 interval
for prediction in predictions:
 prediction['score'] = logit(prediction['score'])
print(predictions)

Expected output:

[{'label': 'SIMILARITY', 'score': 2.118301674983813}, 
{'label': 'SIMILARITY', 'score': 2.1799755855125853}, 
{'label': 'SIMILARITY', 'score': 0.9511617858568939}]

¹ avoid using the widget scores since they are normalized and do not reflect the original annotation values.

Limitations and bias

At the time of submission, no measures have been taken to estimate the bias embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.

Training

Training data

We used the STS dataset in Catalan called STS-ca for training and evaluation.

Training procedure

The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set, and then evaluated it on the test set.

Evaluation

Variable and metrics

This model was finetuned maximizing the average score between the Pearson and Spearman correlations.

Evaluation results

We evaluated the roberta-base-ca-cased-sts on the STS-ca test set against standard multilingual and monolingual baselines:

Model	STS-ca (Pearson score)
roberta-base-ca-cased-sts	79.73
mBERT	74.26
XLM-RoBERTa	61.61

For more details, check the fine-tuning and evaluation scripts in the official GitHub repository.

Additional information

Author

Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)

Contact information

For further information, send an email to aina@bsc.es

Copyright

Licensing information

Apache License, Version 2.0

Funding

This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA.

Citation Information

If you use any of these resources (datasets or models) in your work, please cite our latest paper:

@inproceedings{armengol-estape-etal-2021-multilingual,
 title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
 author = "Armengol-Estap{\'e}, Jordi and
 Carrino, Casimiro Pio and
 Rodriguez-Penagos, Carlos and
 de Gibert Bonet, Ona and
 Armentano-Oller, Carme and
 Gonzalez-Agirre, Aitor and
 Melero, Maite and
 Villegas, Marta",
 booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
 month = aug,
 year = "2021",
 address = "Online",
 publisher = "Association for Computational Linguistics",
 url = "https://aclanthology.org/2021.findings-acl.437",
 doi = "10.18653/v1/2021.findings-acl.437",
 pages = "4933--4946",
}

Disclaimer

Downloads last month: 17

Model tree for projecte-aina/roberta-base-ca-cased-sts

Finetunes

1 model

Dataset used to train projecte-aina/roberta-base-ca-cased-sts

Collection including projecte-aina/roberta-base-ca-cased-sts

Encoders / Decoders models, foundational, pretrained or fine-tuned • 32 items • Updated May 8, 2024 • 8

Paper for projecte-aina/roberta-base-ca-cased-sts

Paper • 1907.11692 • Published Jul 26, 2019 • 10

Evaluation results

Pearson on STS-ca
self-reported
0.797

URL: https://huggingface.co/projecte-aina/roberta-base-ca-cased-sts

⇱ projecte-aina/roberta-base-ca-cased-sts · Hugging Face

Catalan BERTa (roberta-base-ca) finetuned for Semantic Textual Similarity.

Table of Contents

Model description

Intended uses and limitations

How to use

Limitations and bias

Training

Training data

Training procedure

Evaluation

Variable and metrics

Evaluation results

Additional information

Author

Contact information

Copyright

Licensing information

Funding

Citation Information

Disclaimer

Model tree for projecte-aina/roberta-base-ca-cased-sts

Dataset used to train projecte-aina/roberta-base-ca-cased-sts

Collection including projecte-aina/roberta-base-ca-cased-sts

Paper for projecte-aina/roberta-base-ca-cased-sts

Evaluation results