Data access, cleansing, manipulation and aggregation are all critical components of the beginning of a data science experimentation workflow. The addition of "Data" as first-class entity is a new concept we have introduced to simplify the data workflow for our data scientists and data science teams.
How does it work?
When you click into the Data area of the product from the top level navigation bar, you will be dropped into an overview page where you can create a New Data Set or create a Quick Start Data Set.
The Quick Start Data Set will provide examples for connecting to an external data warehouse, connecting to a Hadoop cluster and will show an example for how to check and cleanse raw data that lives in a file so that a clean data set is available for others to import into their projects (if you are a new user, this Quick Start Data Set will be automatically populated in your Data Overview page for you).
Inside the Data area of the product, you can easily search for and find all of the different data sets you have created or others have made visible to you.
Inside of a selected data set, you will notice that the interface looks very similar to the interface inside of a project. The data set will be configured so that it is ready to be imported by a project.
When does it make sense to create a data set?
You may already be doing a lot of your data work inside of an existing project and in many cases, it might make sense to extract that data work from the project into its own data set. For example:
- Your data files are nested inside of a large project. Create a data set so that other projects only have to import the data set instead of the entire project.
- During your exploration and experimentation, you created a new and interesting data set that you would like to save for the future or share with others.
- You have a canonical dataset used by multiple projects. Manage it in a single place.
- You have an external data source requiring login credentials (such as a database). Represent this data source securely via environment variables in a single data set that can used by many projects.
How do I create a data set from an existing project?
From the files page in a project, select the files you would like to extract from the project and then select Create Data Set.
You will be presented with the options to automatically import the newly created data set into the current project and to delete the selected files from the project.
WARNING: Deleting the selected data set files from the main project will impact downstream projects that are importing the files from the main project. The working folder for the environment variable name will change and cause all existing relative references to these files to stop working. You will need to change the environment variable name to
DOMINO_<USERNAME>_<DATA SET NAME>_WORKING_DIR as the alias to the path were the imported data set files can be accessed by other projects. See the support article on Domino Environment Variables for more information.