OSCAR News: September 2021

September 2021 marks an important milestone regarding OSCAR, and we have important news to share.

New pipeline tool

The previous OSCAR corpus was generated by goclassy.

The latest (at the time of writing) and future ones will be generated by Ungoliant, a brand new corpus generation tool written in Rust.

The tool provides a more accessible Command Line Interface, with the following features:

  • Dowloading of CommonCrawl shards from a wet.paths file,
  • Generation of a corpus from CommonCrawl shards,
  • Deduplication,
  • Splitting in files of fixed maxiumum size,
  • Packaging (GZipping + checksum file creation)

It is also possible (but not yet used/tested in terms of ergonomics) to use Ungoliant as a library.

New schema

OSCAR is changing and will change again on the course of the following months/years, and as such it is important to provide a way to announce and specify schema changes.

The new OSCAR Corpus release follows the OSCAR Schema v1.1, enabling users to optionally get metadata extracted from CommonCrawl, while retaining backward compatibility, making the update from OSCAR 2018 to OSCAR 21.09 as effortless as possible.

In a gist, OSCAR Schema v1.1 adds <lang>_meta.jsonl JSONLines files that holds metadata. Each line has an offset and an nb_sentences field that can be used to get related lines in text files.

New corpus

OSCAR 21.09 is the latest release of the OSCAR Corpus. The first to be generated by Ungoliant, and also the first containing metadata.

Note that the data used to generate the corpus is from February 2021, but the next releases of OSCAR Corpus will try to narrow the gap between source data and corpus release.

Another important information is that there is no shuffled version anymore.

It is expected to be available during September 2021 via manual application through the Contact form of the website, and later on other platforms.

Julien Abadji
Julien Abadji
Research Engineer

I’m a research engineer at ALMAnaCH research team at Inria

Pedro Ortiz Suarez
Pedro Ortiz Suarez

I’m a researcher at the Speech and Language Technology Team at DFKI GmbH Berlin.