Frequently Asked Questions
Table of Contents
How can I get the original files using the HuggingFace datasets platform?
HuggingFace’s datasets library internally uses Apache arrow files, different from the
jsonl.gz pair that we use.
.arrow files can usually be found in
However, it is possible to get the original (
jsonl.gz) files using Git LFS.
$> GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/oscar-corpus/OSCAR-2109 # clone the repository, ignoring LFS files $> cd OSCAR-2109 # go inside the directory $> git lfs pull --include packaged/eu/eu.txt.gz # pull the required file(s) (here the Basque corpus). Check with the manpage for pull options
You’re all set! Decompress the files and you are ready to use the corpus.
How can I decompress OSCAR Corpus files?
OSCAR is distributed in split files that are compressed in order to save spave. However, some decompression programs have trouble dealing with
Here are some programs that work well for the three mainstream operating systems:
- OSX/Linux-based: Use
gzip -d FILE.gz.
- Windows: Use 7zip
Having a question or an issue with OSCAR?
Send us your question by mail using the mail address in our homepage