Model Card for Model ID
This model is a reproduction of GlotLID on 125 languages using the Latn script, trained on the original GlotLID-C dataset for these languages, enriched by 1 million word-level examples per language. The word-level examples were obtained from splitting sentences from the dataset. It has also been trained with a bigger hashmap than GlotLID (2e6 instead of 1e6)..
Model Details
Model Description
- Developed by: Joanna Radoła
- Model type: fasttext architecture
- Language(s): fon, fra, fur, fuv, gaz, gla, gle, glg, gug, hat, hau,ace, afr, als, ast, ayr, azj, bam, ban, bem, bjn, bug, cat, ceb, ces, cjk, crh, cym, dan, deu, dik, dyu, ekk, eng, epo, eus, ewe, fao, fij, fil, fin, fon, fra, fur, fuv, gaz, gla, gle, glg, gug, hat, hau, hin, hun, ibo, ilo, ind, isl, ita, jav, kab, kac, kam, kbp, kea, kik, kin, kmb, kmr, knc, kng, lij, lim, lin, lit, lmo, ltg, ltz, lua, lug, luo, lus, lvs, min, mlt, mos, mri, nld, nno, nob, npi, nso, nus, nya, oci, pag, pap, plt, pol, por, quy, ron, run, sag, scn, slk, slv, smo, sna, som, sot, spa, srd, ssw, sun, swe, swh, szl, taq, tpi, tsn, tso, tuk, tum, tur, twi, umb, uzn, vec, vie, war, wol, xho, yor, zsm, zul
How to Get Started with the Model
import fasttext
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(repo_id="paruwka/LiteLID", filename="wordlid_v3.ftz", cache_dir=None)
model = fasttext.load_model(model_path)
model.predict(['predicting', 'language'], k=3) # this will return a tuple: (list of lists of top-k language labels, list of lists of their respective probabilities)
Training Hyperparameters
lr=0.8, epochs=1, dim=256, minn=2, maxn=5, bucket=2000000, loss='softmax'
Evaluation
...
- Downloads last month
- -
Model tree for paruwka/LiteLID
Base model
cis-lmu/glotlid