OSCAR 22.01

The January, 2022 version of the OSCAR Corpus.

If you want to get the new corpus please send us a mail using the mail in our homepage, with “OSCAR Access Request” as mail title. We will need the following information to properly create your account. Missing information could delay your access by days:

  • First and last name:
  • Affiliation:
  • Contact details:
  • Corpus version (or all if you need multiple ones):
  • Languages:

✉️📚

Please do not create a new Huma-Num account by yourself, this will delay your access to the corpus by weeks! We will create an account for you. 📆

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using Ungoliant.

This release of OSCAR is the third one, and is versioned using CalVer.

Version info

These are the versions of tooling, schemes and data

  • CommonCrawl version: November/December 2021 (2021.49)
  • OSCAR Schema version: v2 : Document oriented, with text and metadata merged in JSONLines.
  • Ungoliant version: v1.1.0 : Document oriented pipeline, multilingual subcorpus, annotations

Changes

  • The corpus is not backward compatible, as the metadata and textual content have been merged into JSON objects.
  • Document-oriented: Identifications are done per document, based on the proportion of content in a single language.
  • Annotations: Annotations are labels you can filter on, potentially enabling better quality control (sacrificing corpus size).
  • Per-line identification: Metadata hold line-level identification/confidence.
  • Multilingual subcorpus: A new, 12GB multilingual subcorpus containing documents with sentences in different languages in significant proportion.
  • Removed languages: Bavarian, Chavacano, Northern Frisian, Manx, Haitian Creole, Interlingue, Northern Luri, Mirandese, Eryza, Neapolitan, Pampanga, Romansh, Rusyn, Scots, Tuvinian, Venetian, West Flemish.
  • No deduplicated corpus: Since the subcorpora are document oriented, doing a subcorpus-wide deduplication would break documents’ integrity. We may consult the community and provide tools for deduplication later on.

Data layout

The corpus is now distributed in JSONLines format, which means that each line represents a single document, encoded in a JSON object.

A text extraction utility is being added in the oscar-tools binary, enabling the creation of plain-text datasets from JSONL subcorpora.


{
  "content":"newline\nseparaaaaaaaaaaated\ncontent", //content itself
  // Headers from the crawler
  // Note that nothing is changed, so the content length may be incorrect.
  "warc_headers":{ 
    "warc-refers-to":"<urn:uuid:83f2e1d4-5ed3-41db-86ff-f7826c4c20f9>",
    "warc-date":"2021-09-16T11:07:14Z",
    "warc-block-digest":"sha1:X3OWP47FG2O5LBNMFSNB44FJF2SSRC26",
    "warc-type":"conversion",
    "warc-identified-content-language":"eng",
    "content-length":"1694",
    "warc-target-uri":"https://foo.bar",
    "warc-record-id":"<urn:uuid:3304bc27-17d0-4ffd-a692-340381478a5f>",
    "content-type":"text/plain"
  },
  "metadata":{
    // Document-wide identification.
    // The "prob" is the weighted average of the identified lines.
    "identification":{
      "label":"en",
      "prob":0.6268374
    },

    // Annotations. Can be null if no annotations have been added.
    "annotation":[
      "short_sentences",
      "footer"
    ],

    // Line-by-line identifications
    // Can have null values for lines that did not get an identification.
    "sentence_identifications":[
      {
        "label":"en",
        "prob":0.93925816
      },
      null,
      {
        "label":"en",
        "prob":0.9606543
      }
    ]
  }
}

Annotations

As a first step towards a better filtering of the corpus, we introduce annotations as tools to filter the corpus depending on users’ criteria.

These annotations are imperfect and we will work on improving their usefulness.

  • tiny: The document has a low (<5) number of lines.
  • short_sentences: The document has a high number (>50%) of short lines (<400 bytes)
  • header: The document has a high number of short lines at its head, suggesting the presence of low quality content.
  • footer: The document has a high number of short lines at its tail, suggesting the presence of low quality content.
  • noisy: The document has a high percentage of punctuation (>50%)
  • adult: The document contains adult content. This annotation uses a blocklist and labels a tiny part of the corpus: It does not catch most of the adult content.

More information about the thresholds and annotators are present in our paper.

About the absence of some low-resource languages

A number of low resource languages have disappeared or shrunk dramatically. This is due to the new document-level language identification, which may shadow low-resource languages that exhibit a linguistic proximity with higher-resourced ones.

As an example, Swiss German went from 5MB (21.09) to 360KB (22.01), but we found that the German (500GB) subcorpus contained around 30MB of Swiss German (without filtering out the sentences classified as <0.8 confidence).

It should be possible to rebuild or increase the size of low resource corpora by looking into other corpora and filtering on identifications.

We are working on improving the language classification step, which should help addressing the issue on future corpora.

Table

langsizedocswords
Multilingual12.1 GB1,210,685936,187,711
Afrikaans47.0 MB12,3936,227,310
Albanian3.0 GB437,287326,325,149
Alemannic / Swiss German363.6 kB13937,381
Amharic461.0 MB37,51330,481,153
Arabic84.2 GB8,718,9296,103,711,887
Aragonese10.6 kB1251
Armenian4.7 GB379,267268,031,270
Assamese221.2 MB17,08411,109,557
Asturian73.6 kB773,919
Avaric18.6 kB14582
Azerbaijani3.5 GB491,847291,927,692
Bangla15.1 GB1,171,501751,877,226
Bashkir95.5 MB11,1985,418,474
Basque1.1 GB233,65897,092,942
Belarusian1.8 GB180,046107,227,860
Bihari languages24.2 kB27569
Bishnupriya2.0 MB27198,419
Bosnian10.3 kB10422
Breton33.7 MB16,1193,111,619
Bulgarian35.1 GB2,887,1152,405,981,285
Burmese1.9 GB158,73344,835,970
Catalan13.9 GB2,627,3071,508,919,864
Cebuano44.6 MB5,7425,253,785
Central Kurdish716.4 MB84,95043,913,025
Chechen14.0 MB4,086798,766
Chinese900.9 GB56,524,51823,149,203,886
Chuvash41.8 MB4,7502,465,782
Cornish1.4 kB255
Croatian11.2 MB11,462505,369
Czech58.6 GB10,381,9165,452,724,456
Danish12.6 GB2,265,4791,454,439,292
Dimli (individual language)706 Bytes119
Divehi217.2 MB24,06710,112,205
Dutch114.0 GB20,206,53212,329,127,151
Eastern Mari11.3 MB1,612641,525
Egyptian Arabic2.8 MB1,256176,096
English3.2 TB431,992,659377,376,402,775
Esperanto558.3 MB111,93258,416,628
Estonian9.2 GB1,362,524820,975,443
Filipino646.5 MB70,39481,881,278
Finnish37.8 GB4,948,9612,900,615,928
French382.2 GB52,037,09841,713,990,658
Galician255.2 MB88,80327,051,212
Georgian7.1 GB488,588281,430,479
German496.7 GB70,075,42446,826,676,844
Goan Konkani787.2 kB4638,831
Greek78.3 GB6,738,5465,031,242,803
Guarani9.0 kB10374
Gujarati4.8 GB136,467301,170,777
Hebrew30.3 GB3,132,3962,249,377,984
Hindi23.3 GB1,529,9071,534,799,198
Hungarian53.9 GB6,866,0624,598,787,907
Icelandic2.0 GB396,183210,365,124
Ido77.3 kB1052,690
Iloko97.9 kB758,592
Indonesian17.4 GB2,244,6221,984,195,207
Interlingua40.2 kB610,125
Irish45.6 MB12,2334,877,850
Italian229.3 GB28,502,09224,294,684,830
Japanese258.7 GB36,328,9315,592,948,356
Javanese152.7 kB7010,441
Kalmyk9.3 kB9250
Kannada2.6 GB150,850108,450,571
Karachay-Balkar119.6 kB914,089
Kazakh2.9 GB261,085157,267,307
Khmer1.9 GB121,91030,564,131
Komi119.9 kB1273,335
Korean51.8 GB5,881,4813,854,968,649
Kurdish150.3 MB29,90617,390,759
Kyrgyz518.6 MB62,24428,028,986
Lao337.1 MB28,9146,682,982
Latin4.1 MB4,397187,446
Latvian8.2 GB1,032,987707,361,898
Lezghian375.5 kB12419,250
Limburgish1.4 kB241
Lithuanian20.0 GB2,303,0701,712,802,056
Lojban1.9 MB570260,542
Lombard2.6 kB2225
Low German9.0 MB1,9381,012,561
Lower Sorbian707 Bytes117
Luxembourgish15.8 MB5,1081,545,946
Macedonian3.6 GB341,775244,058,579
Maithili21.6 kB23483
Malagasy57.3 MB3,0287,279,056
Malay5.3 MB5,228217,818
Malayalam4.1 GB250,972137,831,247
Maltese2.5 MB2,208118,190
Marathi3.3 GB250,376160,179,233
Mazanderani128.2 kB767,337
Minangkabau6.0 MB585614,613
Mingrelian7.6 MB2,550253,333
Mongolian2.8 GB237,719176,405,432
Nahuatl languages8.7 kB12179
Nepali3.7 GB391,947177,885,116
Newari5.7 MB1,134273,837
Norwegian2.8 GB973,188279,182,902
Norwegian Nynorsk6.8 MB5,835459,183
Occitan2.1 MB37331,061
Odia487.9 MB52,94223,755,902
Ossetic13.9 MB3,560800,430
Pashto490.3 MB50,31246,293,249
Persian77.4 GB7,665,8716,430,164,396
Piedmontese1.7 MB698188,270
Polish139.0 GB19,301,13712,584,498,906
Portuguese170.3 GB23,735,70718,441,864,893
Punjabi1.1 GB68,09470,068,604
Quechua744 Bytes114
Romanian49.2 GB4,624,7645,261,803,995
Russia Buriat32.9 kB39785
Russian1.1 TB76,060,84462,811,122,663
Sakha65.6 MB6,2843,473,813
Sanskrit136.0 MB4,4725,671,369
Scottish Gaelic137.7 kB1367,769
Serbian6.9 GB577,472482,932,670
Serbian (Latin)931.8 kB73892,875
Sicilian1.5 kB250
Sindhi117.1 MB15,51610,685,611
Sinhala2.0 GB108,593113,179,741
Slovak16.5 GB2,409,5551,619,121,944
Slovenian1.2 GB351,894118,400,246
Somali2.1 kB3109
South Azerbaijani14.1 MB5,381693,746
Spanish381.9 GB51,386,24742,829,835,316
Sundanese5.0 MB263547,145
Swahili1.3 MB462123,050
Swedish48.0 GB7,541,2785,078,331,128
Tajik870.9 MB46,36656,627,727
Tamil11.4 GB556,772452,343,748
Tatar915.3 MB76,39851,875,265
Telugu3.4 GB249,756137,752,065
Thai66.1 GB5,030,2541,626,779,846
Tibetan234.5 MB18,6832,286,269
Turkish75.1 GB10,826,0316,421,221,358
Turkmen4.4 MB2,485276,632
Ukrainian48.8 GB4,558,2142,879,585,992
Emiliano-Romagnolo[eml]901 Bytes153
Upper Sorbian132.8 kB1108,825
Urdu3.4 GB336,994332,816,354
Uyghur201.9 MB18,55611,240,889
Uzbek19.9 MB9,5261,370,842
Vietnamese98.9 GB9,587,23312,283,185,482
Volapük825.9 kB66157,039
Walloon105.7 kB1384,386
Waray7.6 MB933830,872
Welsh409.3 MB90,37849,488,495
Western Frisian75.3 MB21,9466,357,929
Western Mari743.5 kB15543,916
Western Panjabi46.7 MB6,7904,060,419
Wu Chinese137.2 kB883,056
Yiddish232.5 MB23,41815,809,780
Yoruba24.7 kB261,042
Multilingual12.1 GB1,210,685936,187,711
Julien Abadji
Julien Abadji
Research Engineer

I’m a research engineer at ALMAnaCH research team at Inria

Pedro Ortiz Suarez
Pedro Ortiz Suarez
Postdoctoral Researcher

I’m a postdoctoral researcher at the Data and Web Science Group at the University of Mannheim.