News: OSCAR 23.01 Release

The new OSCAR 23.01 is finally available, check it out here! 🚀

After one year of work, we’re happy to announce the release of OSCAR 23.01! 🚀 OSCAR 23.01 is the January 2023 version of the OSCAR Corpus based on the November/December 2022 dump of Common Crawl. While being quite similar to OSCAR 22.01, it contains several new features, including:

  1. KenLM-based adult content detection 👀
  2. Precomputed Locality-Sensitive Hashes for near deduplication 📍
  3. Blocklist-based categories 📚
  4. Zstandard compression 📦
  5. A new blocklist specifically made for Japanese 🇯🇵
  6. Language naming changes to better respect the BCP47 standard ✏️🤓

To get access to this version and more information about it, please visit our documentation here.

This release was made possible by Julien Abadji, Pedro Ortiz Suarez, Rua Ismail, Sotaro Takeshita, Sebastian Nagel and Benoit Sagot.

Pedro Ortiz Suarez
Pedro Ortiz Suarez
Researcher

I’m a researcher at the Speech and Language Technology Team at DFKI GmbH Berlin.

Julien Abadji
Julien Abadji
Research Engineer

I’m a research engineer at ALMAnaCH research team at Inria