OSCAR 21.09

The September, 2021 version of the OSCAR Corpus.

If you want to get the new corpus please send us a mail using the mail in our homepage, with “OSCAR Access Request” as mail title. Please include your name, last name, affiliation, contact details, which languages do you need and a brief description of how you intend to use OSCAR. βœ‰οΈπŸ“š
Please do not create a new Huma-Num account by yourself, this will delay your access to the corpus by weeks! We will create an account for you. πŸ“†
OSCAR 21.09 is now freely available on the Hugging Face’s datasets hub! πŸš€

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using Ungoliant.

This release of OSCAR is the second one, and is versioned using CalVer.

Table of Contents

Features

These are the versions of tooling, schemes and data

  • CommonCrawl version: February/March 2021 (2021.10)
  • OSCAR Schema version: v1.1 : Incorporates metadata in a backward compatible manner.
  • Ungoliant version: v1 : New generation tool, faster and better documented/tested than the previous one: goclassy.

Changes

  • As per OSCAR Schema v1.1, each document/record has associated metadata.
  • New languages: Manx, Rusyn, Scots and West Flemish. Their size and quality still has to be assessed.
  • Removed languages: Central Bikol and Cantonese. Cantonsese was of a very low quality. Central Bikol corpus is still available on OSCAR 2019.

Table

LanguageOSCAR 2019OSCAR 2019 deduplicatedOSCAR 21.09OSCAR 21.09 deduplicatedIssues
afAfrikaans251MB170MB258MB157MB
sqAlbanian2GB1GB3GB1GB
amAmharic377MB215MB405MB241MB
arArabic87GB33GB69GB35GB
anAragonese1MB822KB1MB608KB
hyArmenian3GB1GB4GB1GB
asAssamese117MB73MB135MB95MB
astAsturian2MB2MB7MB4MB
avAvaric418KB331KB421KB325KB
azAzerbaijani2GB1GB3GB1GB
bnBangla10GB6GB14GB7GB
baBashkir133MB93MB110MB77MB
euBasque889MB358MB900MB503MB
barBavarian507B507B2KB1KB
beBelarusian1GB1GB2GB1GB
bhBihari languages112KB34KB579KB120KB
bpyBishnupriya4MB1MB11MB4MB
bsBosnian459KB120KB310KB175KB
brBreton29MB16MB49MB23MB
bgBulgarian33GB14GB34GB15GB
myBurmese2GB1GB2GB1GB
yueCantonese3KB2KB--
caCatalan8GB4GB13GB6GB
cebCebuano40MB24MB81MB58MB
bclCentral Bikol886B886B--
ckbCentral Kurdish509MB236MB784MB367MB
cbkChavacano521B521B168B168B
ceChechen8MB6MB29MB20MB
zhChinese544GB267GB500GB266GB
cvChuvash40MB27MB60MB41MB
kwCornish44KB14KB119KB72KB
hrCroatian237MB115MB361MB169MB
csCzech56GB25GB72GB33GB
daDanish16GB10GB18GB10GB
diqDimli (individual language)147B147B294B147B
dvDivehi131MB81MB143MB111MB
nlDutch82GB41GB97GB47GB
mhrEastern Mari7MB6MB15MB10MB
arzEgyptian Arabic68MB34MB48MB21MB
enEnglish2520GB1294GB2936GB1342GB
myvErzya1KB1KB29KB2KB
eoEsperanto312MB238MB560MB390MB
etEstonian5GB2GB7GB3GB
tlFilipino601MB426MB699MB383MB
fiFinnish28GB13GB35GB20GB
frFrench302GB147GB340GB161GB
glGalician650MB402MB989MB549MB
kaGeorgian3GB1GB6GB2GB
deGerman330GB155GB433GB184GB
gomGoan Konkani2MB1MB3MB2MB
elGreek66GB28GB72GB30GB
gnGuarani36KB23KB32KB25KB
guGujarati1GB756MB1GB950MB
htHaitian Creole3KB3KB2KB1KB
heHebrew21GB10GB29GB11GB
hiHindi17GB9GB26GB13GB
huHungarian42GB18GB60GB29GB
isIcelandic1GB887MB2GB1GB
ioIdo151KB133KB276KB221KB
iloIloko896KB653KB1MB857KB
idIndonesian32GB16GB40GB22GB
iaInterlingua678KB368KB291KB172KB
ieInterlingue24KB1KB7KB2KB
gaIrish91MB62MB131MB69MB
itItalian146GB73GB192GB94GB
jaJapanese231GB112GB208GB96GB
jvJavanese675KB598KB858KB728KB
xalKalmyk115KB114KB62KB62KB
knKannada1GB1GB2GB1GB
krcKarachay-Balkar2MB2MB2MB2MB
kkKazakh2GB1GB3GB1GB
kmKhmer1GB608MB1GB860MB
kvKomi2MB1MB1MB588KB
koKorean25GB11GB35GB15GB
kuKurdish98MB62MB152MB108MB
kyKyrgyz629MB406MB485MB334MB
loLao181MB118MB287MB163MB
laLatin26MB8MB103MB9MB
lvLatvian4GB1GB6GB2GB
lezLezghian3MB3MB2MB2MB
liLimburgish29KB27KB76KB54KB
ltLithuanian9GB4GB12GB5GB
jboLojban753KB694KB929KB731KB
lmoLombard454KB444KB1MB1MB
ndsLow German18MB13MB25MB17MB
dsbLower Sorbian13KB7KB31KB14KB
lbLuxembourgish30MB21MB54MB37MB
mkMacedonian2GB1GB3GB1GB
maiMaithili324KB10KB685KB24KB
mgMalagasy21MB13MB59MB38MB
msMalay116MB43MB146MB60MB
mlMalayalam5GB2GB4GB2GB
mtMaltese24MB17MB51MB26MB
gvManx--1KB907B
mrMarathi2GB1GB3GB1GB
mznMazanderani708KB617KB1MB1MB
minMinangkabau622KB317KB8MB1MB
xmfMingrelian6MB4MB16MB10MB
mwlMirandese1KB1KB3KB2KB
mnMongolian2GB879MB1GB912MB
nahNahuatl languages11KB10KB34KB21KB
napNeapolitan17KB13KB1KB1KB
neNepali1GB1GB3GB2GB
newNewari5MB4MB6MB4MB
frrNorthern Frisian4KB4KB7KB5KB
lrcNorthern Luri77KB64KB183B183B
noNorwegian BokmΓ₯l8GB5GB9GB4GB
nnNorwegian Nynorsk88MB56MB123MB66MB
ocOccitan6MB3MB12MB5MB
orOdia259MB196MB538MB357MB
osOssetic12MB10MB11MB6MB
pamPampanga763B307B3KB3KB
psPashto378MB253MB404MB286MB
faPersian84GB39GB79GB35GB
pmsPiedmontese2MB1MB4MB3MB
plPolish116GB50GB122GB48GB
ptPortuguese132GB67GB159GB71GB
paPunjabi799MB481MB769MB430MB
quQuechua80KB68KB322KB230KB
roRomanian26GB11GB37GB15GB
rmRomansh7KB6KB3KB3KB
bxrRussia Buriat12KB10KB22KB18KB
ruRussian1239GB609GB1201GB542GB
rueRusyn--247B247B
sahSakha43MB27MB57MB39MB
saSanskrit96MB38MB72MB43MB
scoScots--1KB1KB
gdScottish Gaelic1MB1MB2MB1MB
srSerbian4GB2GB6GB3GB
shSerbian (Latin)25MB6MB13MB9MB
scnSicilian3KB2KB4KB3KB
sdSindhi363MB274MB75MB50MB
siSinhala1GB840MB1GB791MB
skSlovak9GB4GB14GB6GB
slSlovenian2GB1GB4GB1GB
soSomali62KB15KB15KB13KB
azbSouth Azerbaijani28MB19MB47MB29MB
esSpanish297GB159GB342GB160GB
suSundanese216KB145KB397KB274KB
swSwahili13MB8MB11MB7MB
svSwedish46GB26GB43GB19GB
tgTajik396MB260MB985MB321MB
taTamil9GB5GB10GB5GB
ttTatar701MB319MB947MB424MB
teTelugu2GB1GB3GB1GB
thThai38GB17GB62GB26GB
boTibetan195MB144MB439MB358MB
gsw1Alemannic German5MB2MB7MB5MB
trTurkish63GB28GB73GB33GB
tkTurkmen10MB7MB25MB20MB
tyvTuvinian11KB8KB9KB7KB
ukUkrainian56GB29GB53GB28GB
emlEmiliano-Romagnolo225KB23KB22KB20KB
hsbUpper Sorbian4MB1MB2MB1MB
urUrdu2GB1GB2GB1GB
ugUyghur127MB86MB187MB123MB
uzUzbek21MB11MB56MB28MB
vecVenetian18KB16KB37KB28KB
viVietnamese72GB33GB87GB42GB
voVolapΓΌk2MB2MB2MB2MB
waWalloon280KB207KB511KB329KB
warWaray2MB2MB4MB4MB
cyWelsh223MB139MB307MB180MB
vlsWest Flemish--134B134B
fyWestern Frisian35MB26MB82MB57MB
mrjWestern Mari1MB1MB645KB521KB
pnbWestern Panjabi11MB9MB68MB45MB
wuuWu Chinese111KB32KB145KB69KB
yiYiddish146MB87MB199MB93MB
yoYoruba56KB26KB229KB120KB

OSCAR Schema v1.1.0

The new OSCAR schema incorporates backward-compatible changes.

Changes

The old OSCAR Schema v1.0 featured the following file hierarchy, in an uncompressed form:

/
β”œβ”€β”€ af
β”‚   β”œβ”€β”€ af_sha256.txt
β”‚   └── af.txt.gz
β”œβ”€β”€ de
β”‚   β”œβ”€β”€ de_sha256.txt    # Checksum file 
β”‚   └── de.txt.gz        # Textual content
β”œβ”€β”€ en
β”‚   β”œβ”€β”€ en_part_1.txt.gz        # Multipart example
β”‚   β”œβ”€β”€ en_part_2.txt.gz
β”‚   └── en_sha256.txt
β”œβ”€β”€ yi
β”‚   β”œβ”€β”€ yi_sha256.txt
β”‚   └── yi.txt.gz
└── zh
    β”œβ”€β”€ zh_sha256.txt
    └── zh.txt.gz

The new OSCAR Schema v1.1 features the following file hierarchy (some languages omitted):

/
β”œβ”€β”€ af
β”‚   β”œβ”€β”€ af_meta.jsonl.gz
β”‚   β”œβ”€β”€ af_sha256.txt
β”‚   └── af.txt.gz
β”œβ”€β”€ de
β”‚   β”œβ”€β”€ de_meta.jsonl.gz # Metadata, in JSONLines format
β”‚   β”œβ”€β”€ de_sha256.txt    # Checksum file 
β”‚   └── de.txt.gz        # Textual content
β”œβ”€β”€ en
β”‚   β”œβ”€β”€ en_meta_part_1.jsonl.gz # Multipart example
β”‚   β”œβ”€β”€ en_meta_part_2.jsonl.gz # Each part is independent,
β”‚   β”œβ”€β”€ en_part_1.txt.gz        # Ex: en_part_2.txt.gz and en_meta_part_2.jsonl.gz
β”‚   β”œβ”€β”€ en_part_2.txt.gz
β”‚   └── en_sha256.txt
β”œβ”€β”€ yi
β”‚   β”œβ”€β”€ yi_meta.jsonl.gz
β”‚   β”œβ”€β”€ yi_sha256.txt
β”‚   └── yi.txt.gz
└── zh
    β”œβ”€β”€ zh_meta.jsonl.gz
    β”œβ”€β”€ zh_sha256.txt
    └── zh.txt.gz

File formats

.txt files

Lines are newline-separated, and documents are double-newline separated. In other terms, there is a blank line between each document.

.jsonl files

These are the metadata, in JSONLines format.

Each line follows the following JSON Scheme:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Metadata",
  "description": "Holds record headers.\n\nEach metadata is linked to a specific paragraph/text zone",
  "type": "object",
  "required": [
    "headers",
    "nb_sentences",
    "offset"
  ],
  "properties": {
    "headers": {
      "type": "object",
      "additionalProperties": {
        "type": "string"
      }
    },
    "nb_sentences": {
      "type": "integer",
      "format": "uint",
      "minimum": 0.0
    },
    "offset": {
      "type": "integer",
      "format": "uint",
      "minimum": 0.0
    }
  }
}

Example:

{
   "headers":{                  // these headers keys are *almost* always present.
      "content-length":"11062", // the content length is not changed and reflects the 
                                // length before filtering and eventual deduplication.
      "warc-target-uri":"...",
      "warc-type":"conversion",
      "content-type":"text/plain",
      "warc-date":"2021-02-24T17:55:29Z", // Following WARC specification, it is the crawl date.
      "warc-identified-content-language":"eng,zho",
      "warc-refers-to":"<urn:uuid:c649de0e-42a3-4e69-b675-98e28e084698>",
      "warc-block-digest":"sha1:V4PYYGYA6ZYA2WACDKSNL6NXGDN6XK6X",
      "warc-record-id":"<urn:uuid:121a822f-5362-4559-8891-d085415cdd90>"
   },
   "offset":0, // Related text is in the text file, from lines offset+1 to lines offset+nb_sentences.
   "nb_sentences":9
}

<lang>_sha256.txt files

These are used to check for eventual corruption during download. They can be used by running sha256sum -c <lang>_sha256.txt.


  1. gsw is ISO 639-2 for Alemannic German. It was previously identified as als in previous OSCAR versions, due to a bug in fasttext. ↩︎

  2. eml identification tag is deprecated and corresponds to rgn and egl tags in ISO 639-3 ↩︎

Julien Abadji
Julien Abadji
Research Engineer

I’m a research engineer at ALMAnaCH research team at Inria

Pedro Ortiz Suarez
Pedro Ortiz Suarez
Postdoctoral Researcher

I’m a postdoctoral researcher at the Data and Web Science Group at the University of Mannheim.