OSCAR
OSCAR
Home
Blog
Funding
Releases
FAQ
Publications
Talks
Team
Contact
Community
Light
Dark
Automatic
1
Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus
We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.
Julien Abadji
,
Pedro Ortiz Suarez
,
Laurent Romary
,
Benoît Sagot
PDF
Cite
Code
Dataset
DOI
CMLC-9
Website
HAL
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
We audit 5 multilingual corpora, finding that lower-resource corpora have systematic issues.
Isaac Caswell
,
Julia Kreutzer
,
Lisa Wang
,
Ahsan Wahab
,
Daan van Esch
,
Nasanbayar Ulzii-Orshikh
,
Allahsera Tapo
,
Nishant Subramani
,
Artem Sokolov
,
Claytone Sikasote
,
Monang Setyawan
,
Supheakmungkol Sarin
,
Sokhar Samb
,
Benoît Sagot
,
Clara Rivera
,
Annette Rios
,
Isabel Papadimitriou
,
Salomey Osei
,
Pedro Ortiz Suarez
,
Iroro Orife
,
Kelechi Ogueji
,
Rubungo Andre Niyongabo
,
Toan Q. Nguyen
,
Mathias Müller
,
André Müller
,
Shamsuddeen Hassan Muhammad
,
Nanda Muhammad
,
Ayanda Mnyakeni
,
Jamshidbek Mirzakhalov
,
Tapiwanashe Matangira
,
Colin Leong
,
Nze Lawson
,
Sneha Kudugunta
,
Yacine Jernite
,
Mathias Jenny
,
Orhan Firat
,
Bonaventure F. P. Dossou
,
Sakhile Dlamini
,
Nisansa de Silva
,
Sakine Çabuk Ballı
,
Stella Biderman
,
Alessia Battisti
,
Ahmed Baruwa
,
Ankur Bapna
,
Pallavi Baljekar
,
Israel Abebe Azime
,
Ayodele Awokoya
,
Duygu Ataman
,
Orevaoghene Ahia
,
Oghenefego Ahia
,
Sweta Agrawal
,
Mofetoluwa Adeyemi
PDF
Cite
HAL
arXiv
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.
Pedro Ortiz Suarez
,
Laurent Romary
,
Benoît Sagot
PDF
Cite
Dataset
ACL Anthology
ACL 2020
HAL
arXiv
Establishing a New State-of-the-Art for French Named Entity Recognition
We explore convert the NER annotations of the French TreeBank to a more user-friendly format and establish a new state of the art for French NER.
Pedro Ortiz Suarez
,
Yoann Dupont
,
Benjamin Muller
,
Laurent Romary
,
Benoît Sagot
PDF
Cite
LREC 2020
HAL
arXiv
ACL Anthology
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.
Pedro Ortiz Suarez
,
Benoît Sagot
,
Laurent Romary
PDF
Cite
Code
Dataset
Slides
DOI
CMLC-7
HAL
Cite
×