ScandiBERT

Note note: The model has been updated on 2022-09-27

The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.

Language	Data	Size
Icelandic	See IceBERT paper	16 GB
Danish	Danish Gigaword Corpus (incl Twitter)	4,7 GB
Norwegian	NCC corpus	42 GB
Swedish	Swedish Gigaword Corpus	3,4 GB
Faroese	FC3 + Sosialurinn + Bible	69 MB

Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated.

This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/

If you find this model useful, please cite

@inproceedings{snaebjarnarson-etal-2023-transfer,
 title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
 author = "Snæbjarnarson, Vésteinn and
 Simonsen, Annika and
 Glavaš, Goran and
 Vulić, Ivan",
 booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
 month = "may 22--24",
 year = "2023",
 address = "Tórshavn, Faroe Islands",
 publisher = {Link{\"o}ping University Electronic Press, Sweden},
}

Downloads last month: 37

Safetensors

Model size

0.1B params

Tensor type

I64

F32

Model tree for vesteinn/ScandiBERT

Finetunes

4 models

URL: https://huggingface.co/vesteinn/ScandiBERT

⇱ vesteinn/ScandiBERT · Hugging Face

ScandiBERT

Model tree for vesteinn/ScandiBERT

Datasets used to train vesteinn/ScandiBERT