Frequently Asked Questions

Table of Contents

How can I get the original files using the HuggingFace datasets platform?

HuggingFace’s datasets library internally uses Apache arrow files, different from the txt.gz/jsonl.gz pair that we use. These .arrow files can usually be found in ~/.cache/huggingface/datasets/ subfolders.

However, it is possible to get the original (txt.gz/jsonl.gz) files using Git LFS.

The following steps assume you have git and git-lfs installed, and are on a UNIX system. The procedure should roughly be the same on Windows, but hasn’t been attempted.

$> GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/oscar-corpus/OSCAR-2109 # clone the repository, ignoring LFS files
$> cd OSCAR-2109 # go inside the directory
$> git lfs pull --include packaged/eu/eu.txt.gz # pull the required file(s) (here the Basque corpus). Check with the manpage for pull options

You’re all set! Decompress the files and you are ready to use the corpus.

How can I decompress OSCAR Corpus files?

OSCAR is distributed in split files that are compressed in order to save spave. However, some decompression programs have trouble dealing with gz files. Here are some programs that work well for the three mainstream operating systems:

  • OSX/Linux-based: Use gzip -d FILE.gz.
  • Windows: Use 7zip

Having a question or an issue with OSCAR?

Send us your question by mail using the mail address in our homepage