OSCAR News: September 2021
September 2021 marks an important milestone regarding OSCAR, and we have important news to share.
New pipeline tool
The previous OSCAR corpus was generated by goclassy.
The latest (at the time of writing) and future ones will be generated by Ungoliant, a brand new corpus generation tool written in Rust.
The tool provides a more accessible Command Line Interface, with the following features:
- Dowloading of CommonCrawl shards from a
wet.paths
file, - Generation of a corpus from CommonCrawl shards,
- Deduplication,
- Splitting in files of fixed maxiumum size,
- Packaging (GZipping + checksum file creation)
It is also possible (but not yet used/tested in terms of ergonomics) to use Ungoliant as a library.
New schema
OSCAR is changing and will change again on the course of the following months/years, and as such it is important to provide a way to announce and specify schema changes.
The new OSCAR Corpus release follows the OSCAR Schema v1.1, enabling users to optionally get metadata extracted from CommonCrawl, while retaining backward compatibility, making the update from OSCAR 2018 to OSCAR 21.09 as effortless as possible.
In a gist, OSCAR Schema v1.1 adds <lang>_meta.jsonl
JSONLines files that holds metadata. Each line has an offset
and an nb_sentences
field that can be used to get related lines in text files.
New corpus
OSCAR 21.09 is the latest release of the OSCAR Corpus. The first to be generated by Ungoliant, and also the first containing metadata.
Note that the data used to generate the corpus is from February 2021, but the next releases of OSCAR Corpus will try to narrow the gap between source data and corpus release.
Another important information is that there is no shuffled version anymore.
It is expected to be available during September 2021 via manual application through the Contact form of the website, and later on other platforms.