News: OSCAR 23.01 Release
The new OSCAR 23.01 is finally available, check it out here! 🚀
After one year of work, we’re happy to announce the release of OSCAR 23.01! 🚀 OSCAR 23.01 is the January 2023 version of the OSCAR Corpus based on the November/December 2022 dump of Common Crawl. While being quite similar to OSCAR 22.01, it contains several new features, including:
- KenLM-based adult content detection 👀
- Precomputed Locality-Sensitive Hashes for near deduplication 📍
- Blocklist-based categories 📚
- Zstandard compression 📦
- A new blocklist specifically made for Japanese 🇯🇵
- Language naming changes to better respect the BCP47 standard ✏️🤓
To get access to this version and more information about it, please visit our documentation here.
This release was made possible by Julien Abadji, Pedro Ortiz Suarez, Rua Ismail, Sotaro Takeshita, Sebastian Nagel and Benoit Sagot.