OSCAR
OSCAR
Home
Blog
Funding
Releases
FAQ
Publications
Talks
Team
Contact
Community
Light
Dark
Automatic
Recent & Upcoming Talks
Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus
We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.
Julien Abadji
,
Pedro Ortiz Suarez
,
Laurent Romary
,
Benoît Sagot
Last updated on Sep 3, 2021
PDF
Code
Workshop
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.
Pedro Ortiz Suarez
,
Laurent Romary
,
Benoît Sagot
Last updated on Sep 3, 2021
Slides
Video
Follow
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.
Pedro Ortiz Suarez
,
Benoît Sagot
,
Laurent Romary
Last updated on Sep 3, 2021
PDF
Code
Workshop
Cite
×