Getting started with Domino datasets

Beta Note

This article is for beta users of datasets V2.

If you're not in the datasets V2 testing program, you should visit the Data Sets V1 documentation. If you're interested in joining the testing program, contact your Customer Success Manager.

 

Overview


This article will guide you through a hands-on tutorial on using datasets in Domino. Domino datasets provide high-performance, versioned and structured filesystem storage in Domino. With Domino datasets, you can build multiple curated pipelines of data in one project, and share them with your fellow contributors across their projects.

You'll be working with data from the Climate Analysis Indicators Tool (CAIT) via World Resources Institute. If you follow along you'll learn how to write, read, append, revert, and manage Domino datasets.

This tutorial will take you approximately 30 minutes to complete.

 

Prerequisites

  • Domino 3.1+
  • Domino executors must have access to the Internet

 

 

 

 

 

Contents


 

 

 

 

 

Creating datasets


Domino datasets belong to Domino projects. Permission to read and write from a dataset is granted to project contributors, just like the behavior of project files. For your first step in this tutorial, you should create a new project. This project will be used to ingest, process, and store data.

Navigate to your Domino application, then click + New Project from the landing page. Give the project an informative name, set its visibility to Private, then click + Create Project.

 

Screen_Shot_2018-11-28_at_1.06.33_PM.png

 

You'll be automatically redirected to the Files page for the new project. On that page, click to open the Datasets tab. This tab shows basic information about the datasets in this project. You'll notice that a default dataset has been created automatically. You can ignore it for now.

 

Screen_Shot_2018-11-28_at_11.17.55_AM.png

 

For the purpose of this tutorial, you'll want to create two new datasets. One to store raw imported data, and one to store some derivative processed data. Click Add New Dataset to get started.

 

Screen_Shot_2018-11-28_at_12.01.03_PM.png

 

Call this first dataset cait-raw, give it an informative description like the one shown above, then click Create. Afterwards, click Add New Dataset again, and follow the same process to create an emissions-trend dataset.

 

Screen_Shot_2018-11-28_at_12.03.14_PM.png

 

Now that your datasets are created, the next step is to configure your project to use them.

 

 

 

 

 

 

Setting up an initial dataset configuration


A Domino dataset is a series of snapshots. Each snapshot is a completely independent state, and represents the contents of a filesystem directory from the time when the snapshot was written.

When you want to interact with a dataset from your Domino project, you must mount the dataset.

There are two ways to do so:

  1. Mounting a dataset as an input dataset makes the contents of a specific snapshot (the most recent, by default) available in a directory at the specified mount point in your Workspace or Run. A dataset mounted only for input cannot be modified.
  2. Mounting a dataset as an output dataset creates an empty directory at the specified mount point, and when your Run or Workspace is stopped a new snapshot is written to the dataset with the contents of that directory. Note that the new snapshot will only contain exactly those files that are in the mounted output directory. Snapshots do not append by default.

It's important to note that the same dataset can be mounted for input and output simultaneously at different mount points, which will be important when we perform append and revert operations later. For now, we'll set up our project to mount the cait-raw dataset for output, so we can populate it with some data.

Dataset configurations for Domino projects are controlled by a file named .domino-ds-config.yaml. If Domino sees this file in the root of your project, it will attempt to read it and make available the dataset configurations specified within.

The .domino-ds-config.yaml file doesn't exist by default, so from the Files page of your project, click the Add File button.

 

Screen_Shot_2018-11-28_at_12.48.25_PM.png

 

Name the file exactly .domino-ds-config.yaml. You can read the full YAML scheme for this configuration file here, but for now you can just copy and paste the following markup to create a configuration that mounts cait-raw for output. 

 

datasetConfigs:

- configurationName: "PopulateRaw"
  outputDatasets:
  - path: "raw-output"
    dataSetName: "cait-raw"

 

The three important values in this configuration are:

  1. "PopulateRaw" is the name you give this configuration so you can identify and select it when starting a Workspace or Run.
  2. "raw-output"  is the directory name that will be mounted to receive new output at /domino/dataset.
  3. "cait-raw"  is the name of the dataset you want to mount.

Once you've filled in the filename and contents, click Save.

 

Screen_Shot_2018-11-28_at_12.58.54_PM.png

 

You're now set up to write some data to a dataset.

 

 

 

 

 

Writing to an output dataset


If you've created a valid .domino-ds-config.yaml file in the root of a project, you'll be presented with an option to Add Dataset Configuration when starting a Run or Workspace. To populate some initial data into cait-raw, you should start up a Jupyter Workspace with the PopulateRaw configuration selected.

From the project menu, click Workspaces.

 

Screen_Shot_2018-11-28_at_1.49.57_PM.png

 

Click Jupyter, give the workspace session an informative name, and click Add Dataset Configuration. Select the PopulateRaw configuration in the dropdown menu, then click Launch Jupyter Workspace.

If you get an error saying that a valid dataset configuration file was not found, double-check that your file is correct YAML, uses spaces instead of tabs for indentation, and is named exactly correctly with no spaces before or after the filename.

In your new Jupyter Workspace, click New > Terminal to access the executor shell.

 

Screen_Shot_2018-11-28_at_2.01.12_PM.png

 

Domino has made some of the CAIT data available in a public bucket on Amazon S3. Run the following commands to fetch two files containing data on CO2 emissions by country for the years 2010 and 2011.

 

wget https://s3.amazonaws.com/dominodatalab-cait/country-emissions-2010.csv
wget https://s3.amazonaws.com/dominodatalab-cait/country-emissions-2011.csv

 

When finished, you should have the two downloaded files in your /mnt working directory.

 

Screen_Shot_2018-11-28_at_2.20.31_PM.png

 

To write these files to your output dataset, you need to copy them to the mount path set in your dataset configuration. In the PopulateRaw configuration, the output dataset was mounted at raw-output. That directory path gets appended to a base path of/domino/dataset.

To queue the files you downloaded for writing to the dataset, use these commands to move them to the output mount.

 

mv country-emissions-2010.csv /domino/dataset/raw-output/
mv country-emissions-2011.csv /domino/dataset/raw-output/

 

Now, the output mount directory contains the two files you want to write to the next snapshot for the output dataset.

 

Screen_Shot_2018-11-28_at_2.35.13_PM.png

 

To write the snapshot, all you need to do is stop your Workspace session. Click Stop from the top menu, then Stop and Commit in the prompt. Domino will detect that there is data in the output mount, and will write a new snapshot.

Back in Domino, open the Datasets tab from your project Files page. Click the name of the output dataset (cait-raw) to view details on it, and you'll see the snapshot you just wrote to the dataset.

You now have a populated dataset that you and other contributors to your project can use to access data for analysis and transformation.

 

 

 

 

 

Reading from an input dataset


In this step, you'll read in the data from the raw dataset, transform it, and write it to a new dataset. The first step is to create a new dataset configuration.

From the Files page of your project, click the filename of .domino-ds-config.yaml to open the file, then click Edit. Paste the following new dataset configuration at the end of the file, then click Save.

 

- configurationName: "WriteTrend"
  inputDatasets:
  - path: "raw-input"
    dataSetName: "cait-raw"
  outputDatasets:
  - path: "trend-output"
    dataSetName: "emissions-trend"

 

When finished, your .domino-ds-config.yaml file will describe two configurations:

  1. the PopulateRaw configuration you used in the previous step
  2. the new WriteTrend configuration that mounts the cait-raw dataset for input, and the emissions-trend dataset for output

 

Screen_Shot_2018-11-29_at_9.26.12_AM.png

 

Using the new WriteTrend configuration, you can write code that reads from the input dataset, performs some processing or analysis operations on the data within, then writes to the different output dataset.

For this step you will write and execute a Python script as a Domino Run, rather than using a Domino Workspace. This will create a repeatable step in a data pipeline, which you could run every time the input dataset changes to get an updated output dataset.

From the Files page of your project, click Add File. Name the file calculate-trend.py, paste in the Python script below, then click Save.

 

from __future__ import division
import pandas as pd
import os
import glob

# load all files from input datasets as pandas dataframes
df_list = []
data_dir = "/domino/dataset/raw-input/"
data_files_list = glob.glob(data_dir+'*')

# function takes two raw MtCO2 columns and returns percentage change
def percentage_change(year1_df, year2_df):
year1_data = year1_df["CO2 emissions (Mt)"]
year2_data = year2_df["CO2 emissions (Mt)"]
change = (year2_data-year1_data)/year1_data * 100
change_formatted = str(round(change,2))+"%"
return change_formatted

# create output dataset
emissions = pd.DataFrame()
emissions['Country'] = pd.read_csv(data_files_list[0])['Country']

# write new columns for each year on year pair
for i in range(len(data_files_list)):
if i == 0:
continue
df_year_1 = pd.read_csv(data_files_list[i-1])
df_year_2 = pd.read_csv(data_files_list[i])
year1 = str(df_year_1.loc[0,'Year'])
year2 = str(df_year_2.loc[0,'Year'])
column_name = "Emissions change " + year1 + " to " + year2
emissions[column_name] = percentage_change

# send data to output dataset
output_file = "/domino/dataset/trend-output/emissions-trend.csv"
emissions.to_csv(output_file)

 

To run this script with the WriteTrend dataset configuration, click Runs from the project menu, then click Run at the top of the Runs list. Enter calculate-trend.py as the file you want to run, then below that click the Advanced datasets configuration tab and choose WriteTrend from the dropdown menu.

 

Screen_Shot_2018-11-29_at_9.55.29_AM.png

 

Click Start Run to execute your code. When finished, you'll see a new snapshot written to the emissions-trend dataset, containing the transformed data from cait-raw. Every time you run calculate-trend.py with this dataset configuration, it will read and process the latest snapshot of cait-raw and write a new snapshot of emissions-trend.

In the next step, you'll learn how to append to a dataset by adding a new file to cait-raw.

 

 

 

 

 

 

Appending to a dataset


Suppose you receive a fresh batch of raw data, in this case a new file with data from 2012 that you want to store alongside the data from 2010 and 2011. The logical operation you want to do is append that file to the existing content of the last snapshot. The procedure for appending to a Domino dataset involves mounting it for both input and output simultaneously.

Remember that by default, mounting a dataset for input makes available the files in the most recent snapshot, and mounting a dataset for output provides an empty directory, the contents of which will become the next snapshot at the end of the Run.

The high-level steps to an append are:

  1. Start a run with the dataset mounted for both input and output
  2. Copy the contents of the input mount to the output mount
  3. Add the data you want to append to the dataset to the output mount

To continue this tutorial example, you first need to write a new dataset configuration. From the Files page of your project, click the filename of .domino-ds-config.yaml to open the file, then click Edit. Paste the following new dataset configuration at the end of the file, then click Save.

 

- configurationName: "AppendRaw"
  inputDatasets:
  - path: "raw-input"
    dataSetName: "cait-raw"
  outputDatasets:
  - path: "raw-output"
    dataSetName: "cait-raw"

 

This new AppendRaw configuration mounts the cait-raw dataset for both input and output. The input mount will be at /domino/dataset/raw-input and the output mount will be at /domino/dataset/raw-output.

 

Screen_Shot_2018-11-29_at_10.10.53_AM.png

 

Now you can perform your append operation by starting up a Domino Workspace with the AppendRaw configuration.

  1. From the project menu click Workspaces.
  2. Select Jupyter.
  3. Give the Workspace an informative name.
  4. Click Add Dataset Configuration.
  5. Choose AppendRaw from the dropdown menu.
  6. Click Launch Jupyter Workspace.

 

Screen_Shot_2018-11-29_at_10.16.23_AM.png

 

In your Jupyter Workspace, click New > Terminal to access the executor shell. Follow these steps to complete your append operation:

  1. Fetch the new file with 2012 data.
    wget https://s3.amazonaws.com/dominodatalab-cait/country-emissions-2012.csv
  2. Copy the previous snapshot of the dataset from the input mount to the output mount.
    cp /domino/dataset/raw-input/* /domino/dataset/raw-output/
  3. Move the new file to the output mount.
    mv country-emissions-2012.csv /domino/dataset/raw-output/
  4. Click Stop then Stop and Commit to end this session and write the new snapshot of the dataset.

 

If you examine the cait-raw dataset from the Datasets tab on the Files page of your project, you'll see a new snapshot with the 2012 file appended to the contents of the previous snapshot. Now, if you want to, you can start a fresh Run of calculate-trend.py with the WriteTrend configuration, to update the emissions-trend dataset with the 2012 data.

 

 

 

 

 

Reverting a dataset to a previous snapshot


Reverting a dataset is similar to appending, in that you'll mount the dataset for input and output simultaneously. However, instead of mounting the latest snapshot for input, you'll mount a specified previous snapshot that you want to revert to. Suppose there was an issue with the 2012 file you added in the previous section, and you want to go back to the initial snapshot of cait-raw.

To identify a specific snapshot in your dataset configuration, you need to tag the snapshot. From the Files page of your project, click the Datasets tab. Next, click the name of the cait-raw dataset.

From the dropdown menu, choose the snapshot you want to tag, in this case Snapshot 0.

 

Screen_Shot_2018-11-29_at_4.35.27_PM.png

 

Click the + Add Tag button below the dataset name in the upper left, then fill in the string you want to tag the snapshot with. In the below example the snapshot is tagged with "good."

 

Screen_Shot_2018-11-29_at_4.32.35_PM.png

 

When finished, you'll see a blue tag with the string you entered appear next to the + Add Tag button. You'll now need to write a dataset configuration that mounts the tagged snapshot of the dataset. From the Files page of your project, click the filename of .domino-ds-config.yaml to open the file, then click Edit. Paste the following new dataset configuration at the end of the file, then click Save.

 

- configurationName: "RevertRaw"
  inputDatasets:
  - path: "raw-input"
    dataSetName: "cait-raw:good"
  outputDatasets:
  - path: "raw-output"
    dataSetName: "cait-raw"

 

This is similar the the append configuration, but note that the input dataset name is entered as cait-raw:good. This is how to set up a dataset configuration to mount a tagged input snapshot. A colon is appended to the dataset name followed by the tag string.

Now you can perform your revert operation by starting up a Domino Workspace with the RevertRaw configuration.

  1. From the project menu click Workspaces.
  2. Select Jupyter.
  3. Give the Workspace an informative name.
  4. Click Add Dataset Configuration.
  5. Choose RevertRaw from the dropdown menu.
  6. Click Launch Jupyter Workspace.

 

Screen_Shot_2018-11-29_at_5.19.03_PM.png

 

In your Jupyter Workspace, click New > Terminal to access the executor shell. Follow these steps to complete your revert operation:

  1. Copy the tagged snapshot contents from the input mount to the output mount.
    cp /domino/dataset/raw-input/* /domino/dataset/raw-output/
  2. Click Stop then Stop and Commit to end this session and write the new snapshot of the dataset.

 

If you examine the cait-raw dataset from the Datasets tab on the Files page of your project, you'll see a new Snapshot 2 with only the original two files from Snapshot 0 in it.

 

 

 

 

 

 

 

Sharing a dataset


To give another user access to the datasets in your project, you need to add them to the project as a Contributor or a Project Importer. Once your colleague has been granted one of those permissions on the project, he or she can refer to your datasets in a .domino-ds-config.yaml file with the scheme:


<your-username>/<your-project-name>/<dataset-name>

 

For example, to mount the emissions-trend dataset from the above examples as an input dataset, your colleague would use a configuration like this, noting that documentation in this path is the username:

 

datasetConfigs:
- configurationName: "import"
  inputDatasets:
  - path: "cait-input"
    dataSetName: "documentation/cait-data/emissions-trend"
Was this article helpful?
0 out of 0 found this helpful