RoBERTaCrawlPT-base

RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora, using the same architecture as RoBERTa-base. This model is part of the RoBERTaLexPT work.

Language(s) (NLP): Portuguese (pt-BR mainly)
License: Creative Commons Attribution 4.0 International Public License
Repository: https://github.com/eduagarcia/roberta-legal-portuguese
Paper: https://aclanthology.org/2024.propor-1.38/

Generic Evaluation

TO-DO...

Legal Evaluation

The model was evaluated on "PortuLex" benchmark, a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.

Macro F1-Score (%) for multiple models evaluated on PortuLex benchmark test splits:

Model	LeNER	UlyNER-PL	FGV-STF	RRIP	Average (%)
Coarse/Fine	Coarse
BERTimbau-base	88.34	86.39/83.83	79.34	82.34	83.78
BERTimbau-large	88.64	87.77/84.74	79.71	83.79	84.60
Albertina-PT-BR-base	89.26	86.35/84.63	79.30	81.16	83.80
Albertina-PT-BR-xlarge	90.09	88.36/86.62	79.94	82.79	85.08
BERTikal-base	83.68	79.21/75.70	77.73	81.11	79.99
JurisBERT-base	81.74	81.67/77.97	76.04	80.85	79.61
BERTimbauLAW-base	84.90	87.11/84.42	79.78	82.35	83.20
Legal-XLM-R-base	87.48	83.49/83.16	79.79	82.35	83.24
Legal-XLM-R-large	88.39	84.65/84.55	79.36	81.66	83.50
Legal-RoBERTa-PT-large	87.96	88.32/84.83	79.57	81.98	84.02
Ours
RoBERTaTimbau-base (Reproduction of BERTimbau)	89.68	87.53/85.74	78.82	82.03	84.29
RoBERTaLegalPT-base (Trained on LegalPT)	90.59	85.45/84.40	79.92	82.84	84.57
RoBERTaCrawlPT-base (this) (Trained on CrawlPT)	89.24	88.22/86.58	79.88	82.80	84.83
RoBERTaLexPT-base (Trained on CrawlPT + LegalPT)	90.73	88.56/86.03	80.40	83.22	85.41

Training Details

RoBERTaCrawlPT is pretrained on:

CrawlPT is a composition of three Portuguese general corpora: brWaC, CC100 PT subset, OSCAR-2301 PT subset.

Training Procedure

Our pretraining process was executed using the Fairseq library v0.10.2 on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs. The complete training of a single configuration takes approximately three days.

This computational cost is similar to the work of BERTimbau-base, exposing the model to approximately 65 billion tokens during training.

Preprocessing

We deduplicated all subsets of the CrawlPT Corpus using the a MinHash algorithm and Locality Sensitive Hashing implementation from the libary text-dedup to find clusters of duplicate documents.

To ensure that domain models are not constrained by a generic vocabulary, we utilized the HuggingFace Tokenizers -- BPE algorithm to train a vocabulary for each pre-training corpus used.

Training Hyperparameters

The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 and a learning rate of 4e-4, each sequence containing a maximum of 512 tokens.
The weight initialization is random.
We employed the masked language modeling objective, where 15% of the input tokens were randomly masked.
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.

For other parameters we adopted the standard RoBERTa-base hyperparameters:

Hyperparameter	RoBERTa-base
Number of layers	12
Hidden size	768
FFN inner hidden size	3072
Attention heads	12
Attention head size	64
Dropout	0.1
Attention dropout	0.1
Warmup steps	6k
Peak learning rate	4e-4
Batch size	2048
Weight decay	0.01
Maximum training steps	62.5k
Learning rate decay	Linear
AdamW $$\epsilon$$	1e-6
AdamW $$\beta_1$$	0.9
AdamW $$\beta_2$$	0.98
Gradient clipping	0.0

Citation

@inproceedings{garcia-etal-2024-robertalexpt,
 title = "{R}o{BERT}a{L}ex{PT}: A Legal {R}o{BERT}a Model pretrained with deduplication for {P}ortuguese",
 author = "Garcia, Eduardo A. S. and
 Silva, Nadia F. F. and
 Siqueira, Felipe and
 Albuquerque, Hidelberg O. and
 Gomes, Juliana R. S. and
 Souza, Ellen and
 Lima, Eliomar A.",
 editor = "Gamallo, Pablo and
 Claro, Daniela and
 Teixeira, Ant{\'o}nio and
 Real, Livy and
 Garcia, Marcos and
 Oliveira, Hugo Gon{\c{c}}alo and
 Amaro, Raquel",
 booktitle = "Proceedings of the 16th International Conference on Computational Processing of Portuguese",
 month = mar,
 year = "2024",
 address = "Santiago de Compostela, Galicia/Spain",
 publisher = "Association for Computational Lingustics",
 url = "https://aclanthology.org/2024.propor-1.38",
 pages = "374--383",
}

Acknowledgment

This work has been supported by the AI Center of Excellence (Centro de Excelência em Inteligência Artificial – CEIA) of the Institute of Informatics at the Federal University of Goiás (INF-UFG).

Downloads last month: 543

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train eduagarcia/RoBERTaCrawlPT-base

Space using eduagarcia/RoBERTaCrawlPT-base 1

Collection including eduagarcia/RoBERTaCrawlPT-base

Evaluation results

F1 on lener_br
test set self-reported
0.892
F1 on UlyNER-PL Coarse
test set self-reported
0.882
F1 on UlyNER-PL Fine
test set self-reported
0.866
F1 on FGV-STF
test set self-reported
0.799
F1 on RRIP
test set self-reported
0.828
Average F1 on PortuLex
test set self-reported
0.848

URL: https://huggingface.co/eduagarcia/RoBERTaCrawlPT-base

⇱ eduagarcia/RoBERTaCrawlPT-base · Hugging Face