Using PySpark in Jupyter Workspaces

Overview


You can configure a Domino Workspace to launch a Jupyter notebook with a connection to your Spark cluster. This allows you to operate the cluster interactively from Jupyter with PySpark.

The instructions for configuring a PySpark Workspace are below. To use them, you must have a Domino environment that meets the following prerequisites:

 

Adding a PySpark Workspace option to your environment


  1. From the Domino main menu, click Environments.
  2. Click the name of an environment that meets the prerequisites listed above. It must use a Domino standard base image and already have the necessary binaries and configuration files installed for connecting to your spark cluster.
  3. On the environment overview page, click Edit Definition.

    edit-def.png

  4. In the Pluggable Workspace Tools field, paste the following YAML configuration.

    pyspark:
      title: "PySpark"
      start: [ "/var/opt/workspaces/pyspark/start" ]
      iconUrl: "https://raw.githubusercontent.com/dominodatalab/workspace-configs/develop/workspace-logos/PySpark.png"
      httpProxy:
        port: 8888
        rewrite: false
        internalPath: "/{{#if pathToOpen}}tree/{{pathToOpen}}{{/if}}"
      supportedFileExtensions: [ ".ipynb" ]
    


    When finished, the field should look like this.

    Screen_Shot_2019-04-25_at_11.43.37_AM.png

  5. Click Build to apply the changes and build a new version of the environment. Upon a successful build, the environment is ready for use.

 

 

Launching PySpark Workspaces


  1. Open the project you want to use a PySpark Workspace in.
  2. Open the project settings, then follow the provider-specific instructions from the Hadoop and Spark overview on setting up a project to work with an existing Spark connection environment. This will involve enabling YARN integration in the project settings.
  3. On the Hardware & Environment tab of the project settings, choose the environment you added a PySpark configuration to in the previous section.
  4. Once the above settings are applied, you can launch a PySpark Workspace from the Workspaces dashboard.

    Screen_Shot_2019-04-25_at_11.52.48_AM.png
Was this article helpful?
0 out of 0 found this helpful