You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

[IMPORTANT: We are forced to temporarily suspend access to OSCAR. The temporary nature of this suspension has led us to choose to implement it as gated access with manual control, and we will not grant any access until the situation has been clarified. We sincerely apologize for this situation and hope to be able to restore access as soon as possible. In the meantime, we remind you that we have always prohibited access to or use of OSCAR that violates the legislation in force where you are located. In France, for example, any use of OSCAR that does not fall within the framework of the so-called 'TDM' or the so-called 'research' exceptions to copyright has always been prohibited.] By filling the form below, you understand that only the metadata and the annotations of OSCAR 23.01 have a cc0-1.0 license, and that the rest of the content is crawled data derived from the November/December 2022 snapshot of Common Crawl, for which the authors of OSCAR do not hold any copyright whatsoever.

Dataset Card for "OSCAR 23.01"

IMPORTANT NOTE: THIS DATASET CARD IS STILL BEING WRITTEN, PLEASE BE PATIENT WHILE WE COMPLETE ALL THE INFORMATION ABOUT THE CORPUS

Dataset Summary

The OSCAR project (Open Super-large Crawled Aggregated coRpus) is an Open Source project aiming to provide web-based multilingual resources and datasets for Machine Learning (ML) and Artificial Intelligence (AI) applications. The project focuses specifically in providing large quantities of unannotated raw data that is commonly used in the pre-training of large deep learning models. The OSCAR project has developed high-performance data pipelines specifically conceived to classify and filter large amounts of web data. The project has also put special attention in improving the data quality of web-based corpora as well as providing data for low-resource languages, so that these new ML/AI technologies are accessible to as many communities as possible.

OSCAR 23.01 is the January 2023 version of the OSCAR Corpus based on the November/December 2022 dump of Common Crawl. While being quite similar to OSCAR 22.01, it contains several new features, including KenLM-based adult content detection, precomputed Locality-Sensitive Hashes for near deduplication, and blocklist-based categories. OSCAR 23.01 has also moved from gzip to Zstandard compression. You might already have zstd installed on your system, but if not, please check the Zstandard website for installation instructions.

Supported Tasks and Leaderboards

OSCAR is mainly intended to pretrain language models and word representations.

Languages

All the data is distributed by language, both the original and the deduplicated versions of the data are available. 151 different languages are available. The table in subsection Data Splits Sample Size provides the language code for each subcorpus as well as the number of words (space separated tokens), lines and sizes for both the original and the deduplicated versions of OSCAR.

Issues

OSCAR 23.01 may have quality issues on low size subcorpora, as it has been the case before.

Note that since the documents are identified as a whole, it is expected to have lines in other languages in a given language subcorpus. As an example, it is known and expected that the German subcorpus contains documents holding lines identified as Swiss German / Alemannic.

If you encounter something that is unexpected, please file an issue here: https://github.com/oscar-corpus/corpus/issues.

Language code	Language	Issues

Dataset Structure

We show detailed information for all the configurations of the dataset.

Data Instances

TODO

Layout

{
 "content":"English sentence\nphrase en français\n????????????", // (1)
 "warc_headers":{ // (2)
 "warc-identified-content-language":"fra,eng",
 "warc-target-uri":"https://fr.wikipedia.org/wiki/...",
 "warc-record-id":"<urn:uuid:29eaa920-d299-4b1d-b687-c72bd8d68116>",
 "warc-type":"conversion",
 "content-length":"35298", // (3)
 "warc-refers-to":"<urn:uuid:39e42055-0d94-4e45-9c6c-9e7056635d64>",
 "warc-block-digest":"sha1:WFH2A5WHCS2H365GIAFYQPI7UOAMFGHB", // (3)
 "warc-date":"2022-11-26T09:45:47Z",
 "content-type":"text/plain"
 },
 "metadata":{
 "identification":{ // (4)
 "label":"fr",
 "prob":0.8938327
 },
 "harmful_pp":4063.1814, // (5)
 "tlsh":"tlsh:T125315FF2B6088901EEA097015DB39B4600B...", // (6)
 "quality_warnings":[ // (7)
 "short_sentences",
 "header",
 "footer"
 ],
 "categories":[ // (8)
 "examen_pix",
 "liste_bu"
 ],
 "sentence_identifications":[ // (9)
 {
 "label":"fr",
 "prob":0.99837273
 },
 {
 "label":"en",
 "prob":0.9992377
 },
 null
 ]
 }
}

Data Splits

Table

	Code	Language	# docs	# words	Content Length :
0	af	Afrikaans	23,994	6,217,024	37.2 MB
1	sq	Albanian	1,342,790	462,694,599	3.2 GB
2	am	Amharic	119,434	40,262,809	512.9 MB
3	ar	Arabic	25,012,116	10,081,452,882	110.7 GB
4	an	Aragonese	34	264	11.0 kB
5	hy	Armenian	1,056,974	336,045,041	4.9 GB
6	as	Assamese	89,542	24,395,215	412.1 MB
7	ast	Asturian	440	10,917	74.1 kB
8	av	Avaric	44	1,073	18.6 kB
9	az	Azerbaijani	1,159,994	316,850,330	3.0 GB
10	bn	Bangla	3,474,086	1,092,983,765	19.1 GB
11	ba	Bashkir	128,248	26,036,637	363.7 MB
12	eu	Basque	678,474	136,672,615	1.2 GB
13	be	Belarusian	445,612	164,729,607	2.3 GB
14	bh	Bihari languages	48	507	6.8 kB
15	bpy	Bishnupriya	2,346	346,947	5.4 MB
16	bs	Bosnian	20	395	3.0 kB
17	br	Breton	36,338	4,759,407	31.4 MB
18	bg	Bulgarian	8,933,998	3,635,273,738	44.1 GB
19	my	Burmese	430,276	82,433,836	3.0 GB
20	ca	Catalan	6,953,898	2,240,460,836	15.3 GB
21	ceb	Cebuano	16,174	6,263,404	41.1 MB
22	ckb	Central Kurdish	182,508	61,334,746	772.9 MB
23	ce	Chechen	11,686	1,051,752	13.9 MB
24	zh	Chinese	138,478,270	44,378,380,161	1.4 TB
25	cv	Chuvash	16,652	3,039,925	42.3 MB
26	kw	Cornish	8	80	432 Bytes
27	hr	Croatian	31,808	3,542,961	26.5 MB
28	cs	Czech	34,859,632	9,717,378,559	77.0 GB
29	da	Danish	7,214,338	2,217,634,340	14.8 GB
30	dv	Divehi	77,060	10,655,359	200.1 MB
31	nl	Dutch	72,552,688	19,564,553,306	135.0 GB
32	mhr	Eastern Mari	9,502	1,615,215	22.9 MB
33	arz	Egyptian Arabic	3,958	385,511	3.7 MB
34	en	English	1,235,510,986	523,869,288,690	3.4 TB
35	eo	Esperanto	226,924	67,774,923	474.8 MB
36	et	Estonian	3,601,904	938,296,892	8.0 GB
37	tl	Filipino	250,558	110,560,444	719.2 MB
38	fi	Finnish	14,471,710	4,198,143,883	41.1 GB
39	fr	French	158,334,998	62,127,088,294	430.5 GB
40	gl	Galician	248,762	38,345,625	255.7 MB
41	ka	Georgian	1,343,036	373,935,158	8.4 GB
42	de	German	206,598,430	73,848,586,648	594.7 GB
43	gom	Goan Konkani	398	121,035	2.3 MB
44	el	Greek	20,282,864	7,691,622,692	95.7 GB
45	gn	Guarani	14	260	2.2 kB
46	gu	Gujarati	425,552	417,001,705	5.6 GB
47	ht	Haitian Creole	2	20,671	93.1 kB
48	he	Hebrew	3,997,888	1,697,158,891	18.0 GB
49	hi	Hindi	5,514,454	2,475,605,444	32.6 GB
50	hu	Hungarian	21,349,372	16,013,364,289	150.1 GB
51	is	Icelandic	1,210,232	294,471,539	2.2 GB
52	io	Ido	224	2,598	16.1 kB
53	ilo	Iloko	144	4,411	28.0 kB
54	id	Indonesian	7,109,778	3,228,020,221	23.4 GB
55	ia	Interlingua	34	9,384	33.5 kB
56	ie	Interlingue	2	0	881 Bytes
57	ga	Irish	29,894	9,054,923	63.2 MB
58	it	Italian	89,021,606	36,327,274,203	259.4 GB
59	ja	Japanese	94,236,404	4,401,059,165	181.2 GB
60	jv	Javanese	172	3,286	25.7 kB
61	xal	Kalmyk	2	27	315 Bytes
62	kn	Kannada	448,500	124,924,350	2.6 GB
63	krc	Karachay-Balkar	496	8,385	122.4 kB
64	kk	Kazakh	677,622	214,679,857	3.3 GB
65	km	Khmer	450,660	59,880,231	3.2 GB
66	kv	Komi	460	5,909	70.3 kB
67	ko	Korean	15,147,698	3,435,866,935	38.1 GB
68	ku	Kurdish	80,338	25,921,607	174.1 MB
69	ky	Kyrgyz	144,288	32,062,783	489.3 MB
70	lo	Lao	118,374	10,659,203	472.1 MB
71	la	Latin	14,384	307,865	2.0 MB
72	lv	Latvian	2,435,882	845,459,899	7.4 GB
73	lez	Lezghian	676	60,634	856.6 kB
74	li	Limburgish	6	169	1.4 kB
75	lt	Lithuanian	5,182,028	1,674,362,574	14.5 GB
76	jbo	Lojban	572	312,315	1.5 MB
77	lmo	Lombard	112	3,269	21.0 kB
78	nds	Low German	5,248	1,612,175	10.7 MB
79	dsb	Lower Sorbian	8	84	664 Bytes
80	lb	Luxembourgish	18,090	2,514,838	18.4 MB
81	mk	Macedonian	1,063,298	389,344,425	4.7 GB
82	mai	Maithili	46	467	6.8 kB
83	mg	Malagasy	10,830	1,416,430	11.2 MB
84	ms	Malay	11,500	238,477	2.6 MB
85	ml	Malayalam	800,936	236,597,838	5.8 GB
86	mt	Maltese	5,180	149,886	1.3 MB
87	mr	Marathi	729,578	252,706,331	4.5 GB
88	mzn	Mazanderani	384	16,115	169.2 kB
89	min	Minangkabau	2,436	305,589	3.8 MB
90	xmf	Mingrelian	7,318	283,316	6.1 MB
91	mwl	Mirandese	4	54	423 Bytes
92	mn	Mongolian	1,061,710	454,350,415	5.8 GB
93	multi	Multilingual	2,948,202	1,251,676,406	11.9 GB
94	nah	Nahuatl languages	38	279	2.4 kB
95	ne	Nepali	1,152,156	278,901,036	4.9 GB
96	new	Newari	1,996	229,703	4.0 MB
97	no	Norwegian	2,797,378	373,160,033	2.6 GB
98	nn	Norwegian Nynorsk	19,470	575,518	3.7 MB
99	oc	Occitan	920	34,701	405.0 kB
100	or	Odia	158,426	31,963,340	543.1 MB
101	os	Ossetic	8,628	3,935,964	50.7 MB
102	ps	Pashto	87,408	30,196,179	261.6 MB
103	fa	Persian	23,813,882	9,609,206,698	93.2 GB
104	pms	Piedmontese	2,524	510,087	3.1 MB
105	pl	Polish	57,184,826	18,073,705,588	147.1 GB
106	pt	Portuguese	36,062,800	15,172,557,311	105.0 GB
107	pa	Punjabi	222,058	104,235,418	1.4 GB
108	qu	Quechua	2	13	143 Bytes
109	ro	Romanian	11,985,668	6,302,600,833	45.6 GB
110	bxr	Russia Buriat	72	698	8.2 kB
111	ru	Russian	194,143,422	78,032,029,344	1.1 TB
112	sah	Sakha	17,566	4,288,051	68.8 MB
113	sa	Sanskrit	16,802	2,479,345	56.3 MB
114	gd	Scottish Gaelic	776	18,458	146.1 kB
115	sr	Serbian	1,677,896	632,781,822	7.7 GB
116	sh	Serbian (Latin)	3,214	166,517	816.4 kB
117	sd	Sindhi	48,566	14,667,207	131.6 MB
118	si	Sinhala	301,066	172,755,385	2.6 GB
119	sk	Slovak	8,931,784	2,704,716,280	21.5 GB
120	sl	Slovenian	1,112,560	192,816,743	1.4 GB
121	so	Somali	6	51	503 Bytes
122	azb	South Azerbaijani	26,364	2,029,729	28.4 MB
123	es	Spanish	153,574,556	63,388,237,965	429.9 GB
124	su	Sundanese	18	258	2.0 kB
125	sw	Swahili	1,664	164,459	1.0 MB
126	sv	Swedish	21,891,348	6,993,719,601	50.0 GB
127	gsw	Swiss German	342	34,328	232.7 kB
128	tg	Tajik	144,932	76,987,285	1.0 GB
129	ta	Tamil	1,638,238	738,824,392	15.8 GB
130	tt	Tatar	262,654	59,253,765	833.8 MB
131	te	Telugu	644,712	201,575,815	3.9 GB
132	th	Thai	14,845,900	2,224,483,018	92.0 GB
133	bo	Tibetan	62,352	6,062,558	531.6 MB
134	tr	Turkish	26,654,330	8,290,890,087	73.7 GB
135	tk	Turkmen	4,576	325,786	3.3 MB
136	uk	Ukrainian	10,059,992	3,183,842,018	44.7 GB
137	x-eml	Emiliano-Romagnol	4	329	1.8 kB
138	hsb	Upper Sorbian	402	15,827	123.2 kB
139	ur	Urdu	887,004	434,023,273	3.8 GB
140	ug	Uyghur	51,304	14,659,554	219.8 MB
141	uz	Uzbek	15,806	1,665,960	15.3 MB
142	vi	Vietnamese	33,933,994	22,424,984,210	140.8 GB
143	vo	Volapük	896	49,968	371.9 kB
144	wa	Walloon	390	6,347	34.3 kB
145	war	Waray	1,494	19,665	126.8 kB
146	cy	Welsh	151,512	52,250,043	333.0 MB
147	fy	Western Frisian	45,458	9,885,788	70.4 MB
148	mrj	Western Mari	496	60,180	765.8 kB
149	pnb	Western Panjabi	12,904	11,844,695	105.8 MB
150	wuu	Wu Chinese	136	1,199	26.8 kB
151	yi	Yiddish	47,438	14,287,370	171.7 MB
152	yo	Yoruba	128	2,396	16.6 kB

Dataset Creation

Curation Rationale

OSCAR was constructed using Ungoliant, a new pipeline derived from goclassy, itself being derived from fastText's one.

The pipeline works on documents rather than lines. Ungoliant is implemented in the Rust programming language, and uses rayon as its data parallelism strategy. Threading is done at shard, record and sentence level, making the whole generation process much more efficient.

Filtering will be explained in a future blog post at our website

Source Data

Initial Data Collection and Normalization

Common Crawl is a non-profit foundation which produces and maintains an open repository of web crawled data that is both accessible and analysable. Common Crawl's complete web archive consists of petabytes of data collected over 8 years of web crawling. The repository contains raw web page HTML data (WARC files), metdata extracts (WAT files) and plain text extracts (WET files). The organisation's crawlers has always respected nofollow and robots.txt policies.

Each monthly Common Crawl snapshot is in itself a massive multilingual corpus, where every single file contains data coming from multiple web pages written in a large variety of languages and covering all possible types of topics.

To construct OSCAR the WET files of Common Crawl were used. These contain the extracted plain texts from the websites mostly converted to UTF-8, as well as headers containing the metatada of each crawled document. Each WET file comes compressed in gzip format and is stored on Amazon Web Services. In the case of OSCAR 22.01, the November/December 2021 snapshot was used. It is composed by 64 000 compressed text files containing documents and their headers.

Who are the source language producers?

The data comes from multiple web pages in a large variety of languages.

Annotations

The dataset does not contain any additional annotations.

Annotation process

N/A

Who are the annotators?

N/A

Personal and Sensitive Information

Being constructed from Common Crawl, Personal and sensitive information might be present. This must be considered before training deep learning models with OSCAR, specially in the case of text-generation models.

Considerations for Using the Data

Social Impact of Dataset

OSCAR is intended to bring more data to a wide variety of lanuages, the aim of the corpus is to make large amounts of data available to lower resource languages in order to facilitate the pre-training of state-of-the-art language modeling architectures.

Discussion of Biases

OSCAR is not properly filtered yet and this can be reflected on the models trained with it. Care is advised specially concerning biases of the resulting models.

Other Known Limitations

The fastText linear classifier is limed both in performance and the variety of languages it can recognize, so the quality of some OSCAR sub-corpora might be lower than expected, specially for the lowest-resource langiuages. Some audits have already been done by third parties.

Additional Information

Dataset Curators

This release of OSCAR was made possible by Julien Abadji, Pedro Ortiz Suarez, Rua Ismail, Sotaro Takeshita, Sebastian Nagel and Benoit Sagot.

Licensing Information

These data are released under this licensing scheme
We do not own any of the text from which these data has been extracted.
We license the actual packaging, the metadata and the annotations of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/
To the extent possible under law, the OSCAR project, Inria, the Univertity of Mannheim and DFKI GmbH have waived all copyright and related or neighboring rights to OSCAR
This work is published from: France and Germany.

Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
* Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
* Clearly identify the copyrighted work claimed to be infringed.
* Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Citation Information

@ARTICLE{2022arXiv221210440J,
 author = {{Jansen}, Tim and {Tong}, Yangling and {Zevallos}, Victoria and {Ortiz Suarez}, Pedro},
 title = "{Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data}",
 journal = {arXiv e-prints},
 keywords = {Computer Science - Computation and Language},
 year = 2022,
 month = dec,
 eid = {arXiv:2212.10440},
 pages = {arXiv:2212.10440},
 doi = {10.48550/arXiv.2212.10440},
archivePrefix = {arXiv},
 eprint = {2212.10440},
 primaryClass = {cs.CL},
 adsurl = {https://ui.adsabs.harvard.edu/abs/2022arXiv221210440J},
 adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

@inproceedings{abadji-etal-2022-towards,
 title = "Towards a Cleaner Document-Oriented Multilingual Crawled Corpus",
 author = "Abadji, Julien and
 Ortiz Suarez, Pedro and
 Romary, Laurent and
 Sagot, Beno{\^\i}t",
 booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
 month = jun,
 year = "2022",
 address = "Marseille, France",
 publisher = "European Language Resources Association",
 url = "https://aclanthology.org/2022.lrec-1.463",
 pages = "4344--4355",
 abstract = "The need for large corpora raw corpora has dramatically increased in recent years with the introduction of transfer learning and semi-supervised learning methods to Natural Language Processing. And while there have been some recent attempts to manually curate the amount of data necessary to train large language models, the main way to obtain this data is still through automatic web crawling. In this paper we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR that could prove more suitable to pre-train large generative language models as well as hopefully other applications in Natural Language Processing and Digital Humanities.",
}

 
@inproceedings{AbadjiOrtizSuarezRomaryetal.2021,
 author = {Julien Abadji and Pedro Javier Ortiz Su{\'a}rez and Laurent Romary and Beno{\^i}t Sagot},
 title = {Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus},
 series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event)},
 editor = {Harald L{\"u}ngen and Marc Kupietz and Piotr Bański and Adrien Barbaresi and Simon Clematide and Ines Pisetta},
 publisher = {Leibniz-Institut f{\"u}r Deutsche Sprache},
 address = {Mannheim},
 doi = {10.14618/ids-pub-10468},
 url = {https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688},
 pages = {1 -- 9},
 year = {2021},
 abstract = {Since the introduction of large language models in Natural Language Processing, large raw corpora have played a crucial role in Computational Linguistics. However, most of these large raw corpora are either available only for English or not available to the general public due to copyright issues. Nevertheless, there are some examples of freely available multilingual corpora for training Deep Learning NLP models, such as the OSCAR and Paracrawl corpora. However, they have quality issues, especially for low-resource languages. Moreover, recreating or updating these corpora is very complex. In this work, we try to reproduce and improve the goclassy pipeline used to create the OSCAR corpus. We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data. Also, unlike OSCAR, the metadata information is at the document level. We release our pipeline under an open source license and publish the corpus under a research-only license.},
 language = {en}
}

@article{kreutzer-etal-2022-quality,
 title = "Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets",
 author = {Kreutzer, Julia and
 Caswell, Isaac and
 Wang, Lisa and
 Wahab, Ahsan and
 van Esch, Daan and
 Ulzii-Orshikh, Nasanbayar and
 Tapo, Allahsera and
 Subramani, Nishant and
 Sokolov, Artem and
 Sikasote, Claytone and
 Setyawan, Monang and
 Sarin, Supheakmungkol and
 Samb, Sokhar and
 Sagot, Beno{\^\i}t and
 Rivera, Clara and
 Rios, Annette and
 Papadimitriou, Isabel and
 Osei, Salomey and
 Suarez, Pedro Ortiz and
 Orife, Iroro and
 Ogueji, Kelechi and
 Rubungo, Andre Niyongabo and
 Nguyen, Toan Q. and
 M{\"u}ller, Mathias and
 M{\"u}ller, Andr{\'e} and
 Muhammad, Shamsuddeen Hassan and
 Muhammad, Nanda and
 Mnyakeni, Ayanda and
 Mirzakhalov, Jamshidbek and
 Matangira, Tapiwanashe and
 Leong, Colin and
 Lawson, Nze and
 Kudugunta, Sneha and
 Jernite, Yacine and
 Jenny, Mathias and
 Firat, Orhan and
 Dossou, Bonaventure F. P. and
 Dlamini, Sakhile and
 de Silva, Nisansa and
 {\c{C}}abuk Ball{\i}, Sakine and
 Biderman, Stella and
 Battisti, Alessia and
 Baruwa, Ahmed and
 Bapna, Ankur and
 Baljekar, Pallavi and
 Azime, Israel Abebe and
 Awokoya, Ayodele and
 Ataman, Duygu and
 Ahia, Orevaoghene and
 Ahia, Oghenefego and
 Agrawal, Sweta and
 Adeyemi, Mofetoluwa},
 journal = "Transactions of the Association for Computational Linguistics",
 volume = "10",
 year = "2022",
 address = "Cambridge, MA",
 publisher = "MIT Press",
 url = "https://aclanthology.org/2022.tacl-1.4",
 doi = "10.1162/tacl_a_00447",
 pages = "50--72",
 abstract = "With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50{\%} sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.",
}

@inproceedings{ortiz-suarez-etal-2020-monolingual,
 title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
 author = "Ortiz Su{'a}rez, Pedro Javier and
 Romary, Laurent and
 Sagot, Benoit",
 booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
 month = jul,
 year = "2020",
 address = "Online",
 publisher = "Association for Computational Linguistics",
 url = "https://www.aclweb.org/anthology/2020.acl-main.156",
 pages = "1703--1714",
 abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}

@inproceedings{OrtizSuarezSagotRomary2019,
 author = {Pedro Javier {Ortiz Su{'a}rez} and Benoit Sagot and Laurent Romary},
 title = {Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures},
 series = {Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019},
 editor = {Piotr Bański and Adrien Barbaresi and Hanno Biber and Evelyn Breiteneder and Simon Clematide and Marc Kupietz and Harald L{"u}ngen and Caroline Iliadi},
 publisher = {Leibniz-Institut f{"u}r Deutsche Sprache},
 address = {Mannheim},
 doi = {10.14618/ids-pub-9021},
 url = {http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215},
 pages = {9 -- 16},
 year = {2019},
 abstract = {Common Crawl is a considerably large, heterogeneous multilingual corpus comprised of crawled documents from the internet, surpassing 20TB of data and distributed as a set of more than 50 thousand plain text files where each contains many documents written in a wide variety of languages. Even though each document has a metadata block associated to it, this data lacks any information about the language in which each document is written, making it extremely difficult to use Common Crawl for monolingual applications. We propose a general, highly parallel, multithreaded pipeline to clean and classify Common Crawl by language; we specifically design it so that it runs efficiently on medium to low resource infrastructures where I/O speeds are the main constraint. We develop the pipeline so that it can be easily reapplied to any kind of heterogeneous corpus and so that it can be parameterised to a wide range of infrastructures. We also distribute a 6.3TB version of Common Crawl, filtered, classified by language, shuffled at line level in order to avoid copyright issues, and ready to be used for NLP applications.},
 language = {en}
}