OSCAR 2019

OSCAR 2019 is the original 2019 release of the OSCAR corpus. It has been generated from Common Crawl corpus using the goclassy architecture.

Table of Contents

Features

OSCAR 2019 is shuffled at line level and no metadata is provided. Thus it is mainly intended to be used in the training of unsupervised language models for NLP.

Data is distributed by language in both original and deduplicated form.

If you need the unshuffled version of OSCAR, please contact us using the contact form. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. You can also download it using HuggingFace’s datasets library.

Even though OSCAR is not Postcardware, we do appreciate when our users send us a postcard. If you want to send us one, you can find the address in the contact section down below.

Citing OSCAR

If you use OSCAR to train a language model, text generation model or any other ML model in general please consider citing our latest paper:

@inproceedings{ortiz-suarez-etal-2020-monolingual,
    title = "A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages",
    author = "Ortiz Su{\'a}rez, Pedro Javier  and
      Romary, Laurent  and
      Sagot, Beno{\^\i}t",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.156",
    pages = "1703--1714",
    abstract = "We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.",
}

The Unshuffled OSCAR

If you need a copy of any of the unshuffled sub-corpora, please contact us using the contact form down below. Please include your name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. We will evaluate your request and answer accordingly.

The unshuffled OSCAR is now available in HuggingFace’s datasets library

They have obtained our permission to redistribute the unshuffled OSCAR and they allow users to download a corpus all at once as opposed to file by file. You can get more information about how to download OSCAR using their library by visiting OSCAR’s dataset card.

Downloading OSCAR

All the data is distributed by language, both the original and the deduplicated versions of the data are available. To download a file just click the desired link on the table below. Languages are split in shards of around 700MB, these shards are standalone. A plain text file with checksums is also provided.

The OSCAR corpus is yet to be filtered, so please be careful when using it, specially for text generation tasks! To see which sub-corpora have been audited, please refer to the list of publications above for more information.

You’ll be asked to create an HumanID account in order to download a corpus. This is intended, and we do it in order to limit traffic and reduce abuse of the infrastructure. The OSCAR corpus is hosted by Huma-Num, you can read more about them on their website.

All sizes are for the uncompressed files.

LanguageWords originalSize originalFile originalWords deduplicatedSize deduplicatedFile deduplicated
Afrikaans43,482,801241Maf29,533,437163Maf
Albanian374,196,1102.3Gsq186,856,6991.2Gsq
Alemannic841,7505.0Mals459,0012.8Mals
Amharic28,301,601360Mam16,086,628206Mam
Arabic8,117,162,82882Gar3,171,221,35432Gar
Aragonese52,8961.3Man45,669801Kan
Armenian273,919,3883.7Ghy110,196,0431.5Ghy
Assamese6,956,663113Mas4,366,57071Mas
Asturian381,0052.4Mast325,2372.0Mast
Avaric24,720409Kav19,478324Kav
Azerbaijani322,641,7102.8Gaz167,742,2961.5Gaz
Bashkir9,796,764128Mba6,922,58990Mba
Basque120,456,652848Meu45,359,710342Meu
Bavarian399503bar399503bar
Belarusian144,579,6301.8Gbe83,499,0371.1Gbe
Bengali623,575,73311Gbn363,766,1435.8Gbn
Bihari8,848110Kbh2,87534Kbh
Bishnupriya198,2864.1Mbpy96,9401.7Mbpy
Bosnian106,448447Kbs20,485116Kbs
Breton5,013,24129Mbr2,890,38416Mbr
Bulgarian2,947,648,10632Gbg1,268,114,97714Gbg
Burmese56,111,1841.9Gmy30,102,1731.1Gmy
Catalan1,360,212,4508.0Gca729,333,4404.3Gca
Cebuano6,603,56739Mceb3,675,02424Mceb
Central Bikol312885bcl312885bcl
Central Khmer20,690,6101.1Gkm10,082,245581Mkm
Central Kurdish48,478,334487Mckb18,726,721226Mckb
Chavacano130520cbk130520cbk
Chechen711,0518.3Mce568,1466.7Mce
Chinese14,986,424,850508Gzh6,350,215,113249Gzh
Chuvash3,041,61439Mcv2,054,81026Mcv
Cornish8,32944Kkw2,70414Kkw
Croatian34,232,765226Mhr16,727,640110Mhr
Czech7,715,977,44153Gcs3,540,997,50924Gcs
Danish2,637,463,88916Gda1,620,091,3179.5Gda
Dhivehi7,559,472126Mdv4,726,66079Mdv
Dimli19146diq19146diq
Dutch13,020,136,37378Gnl6,598,786,13739Gnl
Eastern Mari565,9927.2Mmhr469,2976.0Mmhr
Egyptian Arabic7,305,15166Marz3,659,41933Marz
Emilian-Romagnol6,37625Keml6,12124Keml
English418,187,793,4082.3Ten215,841,256,9711.2Ten
Erzya901.4Kmyv781.2Kmyv
Esperanto48,486,161299Meo37,324,446228Meo
Estonian643,163,7304.8Get309,931,4632.3Get
Finnish3,196,666,41927Gfi1,597,855,46813Gfi
French46,896,036,417282Gfr23,206,776,649138Gfr
Galician102,011,291620Mgl63,600,602384Mgl
Georgian171,950,6213.6Gka91,569,7391.9Gka
German44,878,908,446308Gde21,529,164,172145Gde
Goan Konkani124,2772.2Mgom102,3061.8Mgom
Guarani7,38236Kgn4,68024Kgn
Gujarati72,045,7011.1Ggu50,023,432722Mgu
Haitian1,0143.9Kht8323.3Kht
Hebrew2,067,753,52820Ghe1,032,018,0569.8Ghe
Hindi1,372,234,78217Ghi745,774,9348.9Ghi
Hungarian5,163,936,34540Ghu2,339,127,55518Ghu
Icelandic219,900,0941.5Gis129,818,331846Mis
Ido25,702147Kio22,773130Kio
Iloko142,942874Kilo105,564636Kilo
Indonesian4,574,692,26530Gid2,394,957,62916Gid
Interlingua180,231662Kia100,019360Kia
Interlingue5,35224Kie6021.6Kie
Irish14,483,59388Mga10,017,30360Mga
Italian22,248,707,341137Git11,250,012,89669Git
Japanese4,962,979,182216Gja1,123,067,063106Gja
Javanese104,896659Kjv86,654583Kjv
Kalmyk10,277113Kxal10,155112Kxal
Kannada81,186,8631.7Gkn49,343,4621.1Gkn
Karachay-Balkar185,4362.6Mkrc166,4962.3Mkrc
Kazakh191,126,4692.7Gkk108,388,7431.5Gkk
Kirghiz44,194,823600Mky28,982,620388Mky
Komi201,4042.3Mkv95,2431.2Mkv
Korean2,368,765,14224Gko1,120,375,14912Gko
Kurdish15,561,00394Mku9,946,44060Mku
Lao4,133,311174Mlo2,583,342114Mlo
Latin4,122,20126Mla1,328,0388.3Mla
Latvian520,761,9774.0Glv236,428,9051.8Glv
Lezghian247,6463.3Mlez224,8713.0Mlez
Limburgan4,73029Kli4,28327Kli
Lithuanian1,159,661,7428.8Glt516,183,5253.9Glt
Lojban154,330736Kjbo141,973678Kjbo
Lombard75,229443Klmo73,665433Klmo
Low German2,906,34718Mnds2,146,41713Mnds
Lower Sorbian1,78713Kdsb9667.1Kdsb
Luxembourgish4,403,57729Mlb3,087,65021Mlb
Macedonian189,289,8732.1Gmk102,849,5951.2Gmk
Maithili69,161317Kmai87411Kmai
Malagasy3,068,36021Mmg1,872,04413Mmg
Malay16,696,882111Mms6,045,75342Mms
Malayalam189,534,4724.9Gml95,892,5512.5Gml
Maltese2,995,65424Mmt2,163,35817Mmt
Marathi162,609,4042.7Gmr82,130,8031.4Gmr
Mazanderani73,870691Kmzn64,481602Kmzn
Minangkabau5,682608Kmin4,825310Kmin
Mingrelian299,0985.8Mxmf228,6294.4Mxmf
Mirandese1711.2Kmwl1521.1Kmwl
Modern Greek5,479,180,13762Gel2,412,419,43527Gel
Mongolian181,307,1672.2Gmn68,362,013838Mmn
Nahuatl languages1,23412Knah1,19311Knah
Neapolitan5,28217Knap4,14713Knap
Nepali107,448,2081.8Gne71,628,3171.2Gne
Newari564,6975.5Mnew288,9954.1Mnew
Northern Frisian1,5164.4Kfrr1,5164.4Kfrr
Northern Luri8,02276Klrc6,74063Klrc
Norwegian1,344,326,3888.0Gno804,894,3774.7Gno
Norwegian Nynorsk14,764,98085Mnn9,435,13954Mnn
Occitan750,3015.8Moc512,6783.7Moc
Oriya14,938,567248Mor11,321,740188Mor
Ossetian1,031,26813Mos878,76511Mos
Pampanga130760pam52304pam
Panjabi61,847,806763Mpa37,555,835460Mpa
Persian9,096,554,12179Gfa4,363,505,31938Gfa
Piemontese362,0132.1Mpms337,2461.9Mpms
Polish15,277,255,137109Gpl6,708,709,67447Gpl
Portuguese20,641,903,898124Gpt10,751,156,91864Gpt
Pushto46,559,441361Mps31,347,348242Mps
Quechua10,18678Kqu8,69167Kqu
Romanian3,984,317,05825Gro1,741,794,06911Gro
Romansh1,0937.4Krm9606.5Krm
Russia Buriat96313Kbxr80911Kbxr
Russian92,522,407,8371.2Tru46,692,691,520568Gru
Sanskrit4,331,56993Msa1,713,93037Msa
Scottish Gaelic310,6891.9Mgd207,1101.3Mgd
Serbian364,395,4113.9Gsr207,561,1682.2Gsr
Serbo-Croatian5,292,18425Msh1,040,5735.8Msh
Sicilian5543.3Kscn4682.8Kscn
Sindhi43,530,158347Msd33,028,015263Msd
Sinhala93,053,4651.4Gsi50,864,857802Msi
Slovak1,322,247,7639.1Gsk656,346,1794.5Gsk
Slovenian387,399,7002.5Gsl193,926,6841.3Gsl
Somali1,20261Kso47216Kso
South Azerbaijani2,175,05427Mazb1,528,70919Mazb
Spanish47,545,122,279278Ges25,928,290,729149Ges
Sundanese30,321211Ksu20,278141Ksu
Swahili2,211,92713Msw1,376,9638.1Msw
Swedish7,155,994,31244Gsv4,106,120,60825Gsv
Tagalog98,949,299573Mtl70,121,601407Mtl
Tajik31,758,142379Mtg21,029,893249Mtg
Tamil420,537,1329.3Gta226,013,3305.1Gta
Tatar51,034,893670Mtt23,825,695305Mtt
Telugu123,711,5172.5Gte79,094,1671.6Gte
Thai951,743,08736Gth368,965,20216Gth
Tibetan1,483,589187Mbo936,556138Mbo
Turkish7,577,388,70060Gtr3,365,734,28927Gtr
Turkmen1,113,86911Mtk752,3266.8Mtk
Tuvinian75912Ktyv5407.9Ktyv
Uighur8,657,141122Mug5,852,22583Mug
Ukrainian4,204,381,27653Guk2,252,380,35128Guk
Upper Sorbian545,3514.2Mhsb236,8671.8Mhsb
Urdu331,817,9822.7Gur218,030,2281.7Gur
Uzbek2,450,25621Muz1,381,64412Muz
Venetian3,49218Kvec3,19917Kvec
Vietnamese12,036,845,35968Gvi5,577,159,84332Gvi
Volapük321,1212.0Mvo318,5682.0Mvo
Walloon50,720273Kwa37,543203Kwa
Waray397,3152.5Mwar336,3112.2Mwar
Welsh37,422,441213Mcy23,574,673133Mcy
Western Frisian5,691,07735Mfy4,223,81626Mfy
Western Mari93,3381.2Mmrj87,7801.1Mmrj
Western Panjabi1,426,98612Mpnb1,111,1129.0Mpnb
Wu Chinese11,189109Kwuu4,33332Kwuu
Yakut2,547,62342Msah1,789,17426Msah
Yiddish13,834,320141Myi8,212,97084Myi
Yoruba8,90655Kyo3,51827Kyo
Yue Chinese1863.7Kyue1282.2Kyue

License

These data are released under this licensing scheme:

  • We do not own any of the text from which these data has been extracted.
  • We license the actual packaging of these data under the Creative Commons CC0 license (“no rights reserved”).
  • To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR.
  • This work is published from: France.

CC0

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And use the contact form below.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.

Models

Here is a list of some language models that have been trained using the OSCAR corpus or that are part of the OSCAR project:

ModelLanguageCorpusAuthorsPaperFilesLicense
ELMoBulgarianOSCARPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020bg.zipMIT
ELMoBulgarianWikipediaPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020bg.zipMIT
ELMoCatalanOSCARPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020ca.zipMIT
ELMoCatalanWikipediaPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020ca.zipMIT
ELMoDanishOSCARPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020da.zipMIT
ELMoDanishWikipediaPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020da.zipMIT
ELMoFrenchOSCARPedro J. Ortiz, Yoann Dupont, Benjamin Muller, Laurent Romary and Benoît SagotLREC 2020fr.zipMIT
ELMoFinnishOSCARPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020fi.zipMIT
ELMoFinnishWikipediaPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020fi.zipMIT
ELMoIndonesianOSCARPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020id.zipMIT
ELMoIndonesianWikipediaPedro J. Ortiz, Benoît Sagot and Laurent RomaryACL 2020id.zipMIT

Here is a list of Language models trained by the community:

ModelLanguageCasedCorpusAuthorsPaperWebsiteFilesLicense
AraBERTArabicCasedOSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, AssafirWissam Antoun, Fady Baly and Hazem HajjACL AnthologyGitHubHugging FaceN/A
Arabic-BERTArabicCasedOSCAR and WikipediaAli SafayaArXivGitHubHugging FaceMIT
AraELECTRAArabicCasedOSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, AssafirWissam Antoun, Fady Baly and Hazem HajjArXiVGitHubHugging FaceN/A
AraGPT2ArabicCasedOSCAR, Wikipedia, 1.5B words Arabic Corpus, OSIAN, AssafirWissam Antoun, Fady Baly and Hazem HajjArXivGitHubHugging FaceN/A
CamemBERTFrenchCasedOSCARLouis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît SagotACL 2020camembert-model.frcamembert-base.tar.gzMIT
CamemBERTFrenchCasedSubsample of OSCAR (4 GB of text)Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît SagotACL 2020camembert-model.frcamembert-base-oscar-4gb.tar.gzMIT
LePetitFrenchCasedSubsample of OSCAR (2 GB of text)Vincent Micheli, Martin d’Hoffschmidt, Quentin HeinrichMedium blogilluin.techHugging FaceMIT
GigaBERTArabicCased and UncasedOSCAR, Wikipedia, GigawordWuwei Lan, Yang Chen, Wei Xu, Alan RitterEMNLP 2020GitHubHugging FaceMIT
ELECTRANorwegianCasedOSCAR and OPUSViktor AlmN/AHugging FaceHugging FaceN/A
BERTRomanianCasedOSCAR, Wikipedia and OPUSDumitrescu Stefan and Andrei AvramSOONGitHubHugging FaceMIT
BERTRomanianUncasedOSCAR, Wikipedia and OPUSDumitrescu Stefan and Andrei AvramSOONGitHubHugging FaceMIT
RoBERTaSinhalaN/AOSCARKeshan SodimanaN/AHugging FaceHugging FaceN/A
BERTTurkishCased and UncasedOSCAR, Wikipedia and OPUSStefan SchweterZenodoGitHubHugging FaceMIT
ELECTRATurkishCasedOSCAR, Wikipedia and OPUSStefan SchweterZenodoGitHubHugging FaceMIT
XLMIndicHindi, Bengali, Gujarati, Panjabi, Marathi, Oriya, Assamese, Sinhala, Nepali, Bihari, Bishnupriya, Maithili, Goan Konkani, SanskritCasedOSCARIbraheem Muhammad Moosa, Mahmud Shimul and Ashfia Binte HabibArxivGitHubHugging FaceMIT

If you have trained a model using the OSCAR corpus and would like to have it featured here, please open a pull request in our GitHub repo. Help us grow the community!

Pedro Ortiz Suarez
Pedro Ortiz Suarez
Postdoctoral Researcher

I’m a postdoctoral researcher at the Data and Web Science Group at the University of Mannheim.