Follow

Using Spark and Hadoop with Domino

This article describes ways to access Spark or Hadoop from analyses running on Domino.

Domino provides built-in support for connecting to Spark clusters. You can use Spark with a Hadoop cluster, a Standalone Spark cluster or in Local mode. Spark is supported in PySpark (Jupyter), Zeppelin and RStudio notebooks.

If you already have a cluster

If you already have a Hadoop or Spark cluster, code running on Domino can access them however you would normally access them. That is, Domino's executor machines can act as edge nodes of your existing cluster.

Configuring Spark connection

Spark Standalone cluster

To configure Spark integration with a Standalone cluster, open your project and go to “Project settings.” Under “Integrations”, choose the “Standalone Spark Cluster” option for Apache Spark. Insert the address of the cluster master instance. Optionally, you can provide additional configuration for Spark. Click “Save” to save your changes.

Now you can start a notebook and your Spark session will be connected to the Standalone Spark cluster.

Running Spark on a Hadoop (YARN) cluster

 To configure Spark integration with a Hadoop cluster, open your project and go to “Project settings.” Under “Integrations”, choose the “YARN” option for Apache Spark. Optionally, you can provide additional configuration values for Spark, custom hosts entries for your cluster’s instances and Hadoop username to be used. Click “Save” to save your changes.

 

Your cluster’s configuration is stored in a set of XML files. These files are necessary to connect to the cluster. Upload them to your project, into `hadoop/conf`.

Now you can start a notebook and your Spark session will be connected to the Hadoop cluster.

If you need Kerberos authentication, contact us.

Spark examples

After you’ve configured your project to connect with the cluster, you can launch a PySpark (Jupyter), Zeppelin or RStudio notebook and you’ll be automatically connected to the cluster. We have a sample project with examples in Python (using PySpark), Scala (using Zeppelin) and R (using RStudio).

Example in Python (here):

from __future__ import print_function 
import time
from operator import add

data_file = "mobydick.txt"
start_time = time.time()
lines = sc.textFile(data_file, 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
                 .map(lambda x: (x, 1)) \
                 .reduceByKey(add)
output = counts.collect()
sc.stop()    

print(output[:20])
print("--- %s seconds ---" % (time.time() - start_time))

Example in Scala (here):

The Scala example is built in Scala in a Zeppelin Notebook.

val textFile = sc.textFile("mobydick.txt")
// val textFile = sc.textFile("hdfs://...")

val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()

Example in R (here):

(coming soon) 

Hadoop examples

Because Domino runs your code in fully customizable compute environments, you have complete flexibility to install your own tools, drivers, and packages there.

We have a public project with examples of accessing a Hadoop cluster hosted on IBM's Big Data Sandbox. In the examples, we use Domino's environment variables feature to securely store credentials for accessing the cluster. Then, you can use a variety of libraries in whatever programming language you want in order to connect to your cluster.

Example in R (here):

install.packages("RJDBC")

username = Sys.getenv("USERNAME")
password = Sys.getenv("PASSWORD")

database = "bigsql"
hostname = "iop-bi-master.imdemocloud.com"
port = "32051"

library(RJDBC)
jcc = JDBC("com.ibm.db2.jcc.DB2Driver", "./db2jcc4.jar")
jdbc_path = paste("jdbc:db2://",  hostname, ":", port, "/", database, sep="")
conn = dbConnect(jcc, jdbc_path, user=username, password=password)

query = "select * from gosalesdw.emp_employee_dim"
rs = dbSendQuery(conn, query)
df = fetch(rs, -1)
df

write.csv(df, file = "df.csv")

Example in Python (here):

import os
username = os.environ['USERNAME']
password = os.environ['PASSWORD'] 

database = "bigsql";
hostname = "iop-bi-master.imdemocloud.com";
port = "32051"

import ibm_db, ibm_db_dbi;
conn_string = (
     "DRIVER={{IBM DB2 ODBC DRIVER}};"
     "DATABASE={0};"
     "HOSTNAME={1};"
     "PORT={2};"
     "PROTOCOL=TCPIP;"
     "UID={3};"
     "PWD={4};").format(database, hostname, port, username, password);
conn = ibm_db.connect(conn_string, "", "")

query = "select * from gosalesdw.emp_employee_dim";
stmt = ibm_db.exec_immediate(conn, query);
ibm_db.fetch_both(stmt)

Using a local Spark cluster

We assume that if you are interested in Hadoop, it is because your data volumes are so large that it wouldn't be practical for you to transfer it over the network onto a new cluster.

However, many people use Spark for its expressive API, even if their data volumes are small or medium. Because Domino lets you run code on large machines (e.g., in AWS, 32 cores and 240GB of memory), you can create a "local" Spark cluster and easily parallelize your tasks across all 32 cores.

Configuring Spark in Local mode

To configure Spark integration in Local mode, open your project and go to “Project settings.” Under “Integrations”, choose the “Local mode” option for Apache Spark. Click “Save” to save your changes.

 

Was this article helpful?
0 out of 0 found this helpful

Comments