- Writing to a local Dataset
- Reading from a shared Dataset
- Managing Datasets
Domino Datasets provide high-performance, versioned and structured filesystem storage in Domino. With Domino Datasets, you can build multiple curated pipelines of data in one project, and share them with your fellow contributors across their projects.
A Domino Dataset is a series of Snapshots. Each Snapshot is a completely independent state of the Dataset, and represents the contents of a filesystem directory from the time when the snapshot was written. There are two key ways to interact with a Domino Dataset:
- you can write a new snapshot to one of your project's local Datasets
- you can read from an available snapshot of a shared Dataset you have mounted
- Domino 3.3.1+ on AWS in an EFS-enabled region
- If you're running Domino on-premises and want to use Datasets, contact your Customer Success Manager for more information.
Domino Datasets belong to Domino projects. Permission to read and write from a dataset is granted to project contributors, just like the behavior of project files. A Dataset that belongs to a project is considered to be local to that project. To create a new Dataset in your project, click Datasets from the project menu, then click Create New Dataset.
Supply a name and optional description, then click Upload Contents. The upload page provides four ways to write to your local dataset.
- Browser Upload
In Domino 3.5+ you can use the Upload Files section to queue up to 50GB or 50,000 individual files for upload through your browser. You can pause this upload and resume within 24 hours. You can upload directories and subdirectories to preserve your filesystem structure.
- CLI Upload
After installing and configuring the Domino CLI, you can copy and paste the displayed command to upload a directory of files from your local machine to the Dataset. Note that all contents of the directory you specify are written to the Dataset.
For the example shown above, if the files you want to write to the Dataset are in
/Users/myUser/data, you would run the following command:
domino upload-dataset njablonski/simple-model/main /Users/myUser/data
When finished, click Complete. You will then be taken to the Dataset overview where you should see a new Snapshot has been written. The new Snapshot will contain exactly those files that were in the folder you uploaded from your local machine.
- Upload by Running Script
Before using this method, you need a script in your project files that is configured to write to the the target Dataset. Supply the name of a Bash, Python, or R script and click Start to launch a Job. During the Job, an empty folder will be available at the path shown in Output Directory. At the conclusion of the Job, any files that your script has written to the output directory will be written to your Dataset as a new Snapshot.
In the example above, the Output Directory is /domino/datasets/main/output. For the simplest possible example, you could run the below script for a situation where there is a file named data.csv in your project.
cp $DOMINO_WORKING_DIR/data.csv /domino/datasets/output/<Dataset_Name>When the script runs, it will copy the data file to the Dataset output directory. Then, when the Job is finished, Domino will write a new snapshot to the Dataset. The new Snapshot will contain the exact contents of the output directory, which in this case is just the data.csv file.
- Upload by Launching a Workspace
This method works similarly to uploading by running a script. You will have all of the usual options available from your Domino environment for launching a Workspace. When the Workspace is launched, an empty folder will be available at the path shown in Output Directory. When you stop and sync the Workspace, any files that you have written to the output directory will be written to your Dataset as a new Snapshot.
There is a configurable limit to the number of snapshots a Dataset may contain.
This limit defaults to 20 Snapshots.
If your Dataset is at this limit, attempting to start an upload with any of the above methods will result in an error message. Before you can write additional Snapshots, you will need to delete old Snapshots or increase the limit. Talk to your local administrator for more information.
You can create a maximum of 5 local Datasets per project.
If you are setting up a pipeline that requires more than 5 Datasets, use separate projects for each logical task that import Datasets from the project that precedes them in the pipeline.
To access the contents of an existing Dataset snapshot, you must mount the target Dataset in your project. To mount a Dataset, click Datasets from the project menu, then click Mount Shared Dataset.
Click the Dataset to Mount field to see an autocomplete dropdown of Datasets you have access to. To access a Dataset, you must be an Owner, Contributor, Project Importer, or Results Consumer on the project that contains the Dataset.
There are three different settings under Update Behavior that control which Snapshot of the target Dataset your project will mount. You can mount the latest Snapshot, a tagged Snapshot, or a fixed Snapshot number. When finished, click Mount.
Now, on the Datasets page for your project you will see the Dataset you mounted listed under Shared Datasets.
The Path shown for the Dataset points to a directory where you will find the file contents of the mounted Snapshot in your project's Runs and Workspaces. When mounted this way, the Dataset is read-only.
From the Datasets page of your project, click the name of a local or shared Dataset to open its overview page. At the top of the overview page you will see the Dataset name and description, plus buttons to upload to, rename, or archive the Dataset.
Below the description is a panel with Dataset details. Use the dropdown menu at the top of the panel to select a Snapshot, then the panel will display a list of the files it contains, plus some metadata about the Snapshot.
There are two important actions you can take on a Dataset snapshot:
- Add Tag
Above the Snapshot selection dropdown you will find a list of tags applied to the Snapshot, followed by a + Add Tag button. Tags can be used to identify a Snapshot when mounting a shared Dataset for input. This allows the Dataset owner to tag a Snapshot for production use, and move the tag to whichever Snapshot is in the desired state as the Dataset changes over time.
- Mark for Deletion
Clicking the Mark for Deletion button in the lower right of the panel will mark the currently selected Snapshot for deletion, changing its status. Such Snapshots can no longer be mounted in Runs by consuming projects. The Snapshot will be flagged to a Domino administrator as ready for deletion, but will not be fully deleted until the administrator takes an additional action to delete it, at which point the data is erased and no longer recoverable.