News: OSCAR 23.01 Release

The new OSCAR 23.01 is finally available, check it out here! πŸš€

After one year of work, we’re happy to announce the release of OSCAR 23.01! πŸš€ OSCAR 23.01 is the January 2023 version of the OSCAR Corpus based on the November/December 2022 dump of Common Crawl. While being quite similar to OSCAR 22.01, it contains several new features, including:

  1. KenLM-based adult content detection πŸ‘€
  2. Precomputed Locality-Sensitive Hashes for near deduplication πŸ“
  3. Blocklist-based categories πŸ“š
  4. Zstandard compression πŸ“¦
  5. A new blocklist specifically made for Japanese πŸ‡―πŸ‡΅
  6. Language naming changes to better respect the BCP47 standard βœοΈπŸ€“

To get access to this version and more information about it, please visit our documentation here.

This release was made possible by Julien Abadji, Pedro Ortiz Suarez, Rua Ismail, Sotaro Takeshita, Sebastian Nagel and Benoit Sagot.

Pedro Ortiz Suarez
Pedro Ortiz Suarez
Researcher

I’m a researcher at the Speech and Language Technology Team at DFKI GmbH Berlin.

Julien Abadji
Julien Abadji
Research Engineer

I’m a research engineer at ALMAnaCH research team at Inria