Working with big data in Domino

Introduction


When you start a Run, Domino copies your project files to the executor that is hosting the run. After every run in Domino, by default, the Domino will try to write all files in the working directory back to the project as a new revision. When working with large volumes of data, this presents two potential problems:

  1. The number of files that are written back to the project may exceed the configurable limit. By default, the file limit for Domino project files is 10,000 files.

  2. The time required for the write process is proportional to the size of the data. It can take a long time if the size of the data is very large.

There are three solutions to consider for these problems, discussed in detail below. The following table shows the recommended solution for various cases.

Case Data size # of files Static / Dynamic Solution

Large volume of static data

Unlimited Unlimited Static Domino Datasets
Large volume of dynamic data Up to 300GB Unlimited Dynamic  Project Data Compression

 

 

 

 

Domino Datasets


When working on image recognition or image classification deep learning projects, it is common to require a training dataset of thousands of images. The total dataset size can easily become tens of GB. For these types of projects, the initial training also uses a static dataset. The data is not constantly being changed or appended to. Furthermore, the actual data that is used is normally processed into a single large tensor.

You should store your processed data in a Domino Dataset. Datasets can be mounted by other Domino projects, where they are attached as a read-only network filesystem to that project's Runs and Workspaces.

For more information on Datasets, see the Datasets overview.

 

 

 

 

Project Data Compression


There may be times when you want to work with logs as raw text files. When that is the case, typically new log files are constantly being added to the dataset, so your dataset is dynamic. Here, we can encounter both potential problems detailed in the introduction simultaneously:

  1. number of files being over the 10k limit
  2. long preparing and syncing times.

We currently recommend storing your large number of files in a compressed format. If you need the files to be in an uncompressed format during your Run, you can use Domino Compute Environments to prepare the files. In the pre-run script, you can uncompress your files:

tar -xvzf many_files_compressed.tar.gz

Then in the post-run script, you can re-compress the directory:

tar -cvzf many_files_compressed.tar.gz /path/to/directory-or-file

If your compressed file is still very large, the preparing and syncing times may still be long, depending on how large your compressed file is. Consider storing these files in a Domino Dataset to minimize the copying times.

 

 

 

 

Keywords: large data, big data, large dataset, many files, too many files

Was this article helpful?
0 out of 0 found this helpful