Working with large datasets in Domino

Introduction

When you start a run, Domino copies project files to the executor that is hosting the run. After every run in Domino, by default, the Domino will try to write all files in the working directory, (defaults to /mnt or /mnt/<username>/<projectname>) back to the project as a new revision. When working with large datasets, this presents two potential problems:

  1. The number of files that are written back to the project may exceed the configurable limit. By default, the file limit is 10,000 files.
  2. The time required for the write process is proportional to the size of the data. It can take a long time if the size of the data is very large.

There are three solutions to consider for these problems, discussed in detail below. The following table shows the recommended solution for various cases.

Case Data size # of files Static / Dynamic Solution

Large static dataset with small number of files

Up to 300GB Less than 10k Static Domino Data Projects
Large dynamic dataset with large number of files Up to 300GB Unlimited Dynamic  Project Data Compression
Small static dataset with large number of files

Up to 1GB total.

Up to 100MB per file.

Unlimited Static External Git Repository

 

Domino Data Projects

When working on image recognition or image classification deep learning projects, it is common to require a training dataset of thousands of images. The total dataset size can easily become tens of GB. For these types of projects, the initial training also uses a static dataset. The dataset is not constantly being changed or appended to. Furthermore, the actual dataset that is used is normally processed into a single large tensor.

You should store your processed dataset in a Domino Data Project. Data Projects are meant to be imported into other Domino project, where they are treated as read-only datasets. When you first do a run with an imported Data Project, it will take the same amount of time to copy the data to the executor. For all subsequent runs, the data in the Data Project will be loaded instantly into the run. Therefore, you should be able to use your large static dataset on all subsequent runs without waiting on long “preparing” times.

If you need to store the image files separately instead of as 1 file, you can still use data projects. However, the number of files will be limited, by default, to 10,000 files.

Syncing a large number of files in Domino will take a long time. If you are using Data Projects, expect the first sync to the Data Project and the first Run in your analysis Project to take several hours. All subsequent runs in your analysis projects that import the Data Project should load the files from your Data Project instantly.

For more information on Data Projects, see our Data Sets.

 

Project Data Compression

While we recommend that log data be pushed to a database for easy querying and retrieval, there may be times when you want to work with logs as raw text files. When that is the case, typically new log files are constantly being added to the dataset, so your dataset is dynamic. Here, we can encounter both potential problems detailed in the introduction simultaneously:

  1. number of files being over the 10k limit
  2. long preparing and syncing times.

We currently recommend storing your large number of files in a compressed format. If you need the files to be in a uncompressed format during your Run, you can leverage Domino Compute Environments to prepare the files. In the pre-run script, you can uncompress your files with something like

tar -xvzf many_files_compressed.tar.gz

In the post-run script, you could re-compress the directory with

tar -cvzf many_files_compressed.tar.gz /path/to/directory-or-file

If your compressed file is still very large, the “preparing” and “syncing” times may still be long, depending on how large your compressed file is. If there is any part of your dataset that is static, you can store these in a Data Project to minimize the copying times.

If this is a special project and you are appending to your dataset in a predictable and regular way, you can create a data pump and associate it with a reserved hardware tier. When Domino executes a Run, all of the files are cached on the particular executor. By using a reserved hardware tier that no one else uses, you can ensure that all Runs from that project will occur on a much smaller set of machines. A data pump is a scheduled run, which pulls in new data to your project. On the conclusion of the data pump run, Domino will save the compressed file back to Domino, which may take a long time but will occur automatically during non-peak hours. The executor that the data pump run executed on will also have the project files cached on the machine. Consequently, your next run should occur quickly because the number of files to be checked by Domino will be small and the files will already be cached on your executor in your reserved hardware tier.

External Git repository

Sometimes your dataset may actually be fairly small but still require thousands of files. While transferring your files quickly will not be an issue in Domino, the file size limit can be. In these cases, we recommend either compressing your files into one file and using the Domino Compute Environments as described in the above section. An alternative can be to also use Domino’s Github integration and store the data in Github.

Git alone can handle thousands of files fairly quickly. However, there is a total size limit of 1GB for a repository and 100MB for any one file in Github. If your dataset meets those requirements, you may think about using Github as a place to store your files.

For more information on Domino’s Github integration, see Git repositories in Domino.

 

Was this article helpful?
0 out of 0 found this helpful