Recommandations on large datasets
Introduction
This document sets out the best practices for managing large datasets that are deposited in the Recherche Data Gouv repository.
The values and ideas given below are recommendations derived from experience rather than having developed because of any technical limitations linked to the tool.
If you have specific requirements that are not covered by this document, please feel free to reach out to the platform's resource centre at: support-recherchedatagouv@inrae.fr.
General information on file uploads
There are three methods available for uploading files to the repository:
1. **Deposit Interface**: This method is recommended for datasets that are less than 50 GB or contain fewer than 200 files. Please note that the maximum file size limit is 50 GB.
2. **DVUploader Application**: This is the best option for datasets made up of more than 200 files or over 50 GB in size. DVUploader also allows you to maintain the file tree structure when the deposit includes multiple directories or files.
3. **S3-Direct-Upload API**: This method is intended for users who are comfortable working with APIs. However, it is not recommended to use any other APIs for file deposits.
Important considerations
DOI attribution
Each file you deposit will be assigned a Digital Object Identifier (DOI). It is the depositor's responsibility to organize their files coherently within their dataset. While it may not be necessary to cite every individual file, it's important to consider which elements of the dataset should have this functionality.
Datasets with multiple files
To preserve the file structure without using DVUploader, you can organize files into compressed folders (.zip, .xz, .7z, .bzip, .gz). Please note the following:
Only the ZIP format enables you to preview the tree structure and download files individually.
We recommend keeping the following files outside of any compressed folder:
- Files that require a DOI (citation).
- Files that enhance the dataset's accessibility (such as Readme, metadata files, illustrative images, etc.).
- Files for which previewing and/or ingestion is desirable.
Datasets with large files (over 100 GB per dataset)
As noted earlier, we recommend using DVUploader for submission of datasets larger than 50 GB. Additionally, please consider the following:
- Do not split large files to bypass the 50 GB limit. Instead, use the DVUploader application for the upload.
- Using an open or discipline-specific compression format.
For datasets in the terabyte (TB) range, please contact the platform's resource centre at support-recherchedatagouv@inrae.fr in advance.
For institutional datasets, please ensure compliance with the limit set by your agreement (currently 5 TB). If you are approaching or exceeding this limit, please contact the platform's resource center at support-recherchedatagouv@inrae.fr.