Whether you are using Python, R or another language, you may have datasets residing in Amazon S3. This article will walk you through some simple examples which show how to access and use S3 data from within Domino.
From within a Jupyter notebook or via batch script, you can connect to S3 from Python using a few different API's available. One of the easiest to use that we have seen is Boto. You will need to add Boto support to your requirements.txt file before proceeding, or optionally, if you are using a docker image that does not have support for this python library, you can add the following RUN statement to a new dockerfile inside of Domino:
RUN pip install awscli boto3
Let's set up four constants for getting a CSV file from S3. In the example below, you would change the values accordingly:
Instead of hardwiring S3 values into this script, even better is to make use of Domino Environment Variables. That way, your code is truly isolated from the S3 access credentials.
From here you can copy down the file and then start working against it:
From within a Jupyter session, you could then verify that the file was copied down:
And then using Pandas start to read and use the file:
If your file is small or transient in nature, you may not even want to copy it down - you can just read it from the web directly in situ. In the following example, we've assigned to the variable "test2" the client response from getting an object from S3:
Shown below is a complete code example. Remember, you will need two things:
- To add boto3 to your requirements.txt file
- To correctly enter the four constant values for your S3 access information.
# in this example we added boto3 to our requirements.txt file import zipfile import boto3 import io import pandas as pd YOUR_BUCKET='YOURBUCKETNAME' YOUR_ACCESS_KEY='YOURKEY' YOUR_SECRET_KEY='YOURSECRETKEY' YOUR_DATA_FILE='yourdatafile.csv' client = boto3.client( 's3', aws_access_key_id=YOUR_ACCESS_KEY, aws_secret_access_key=YOUR_SECRET_KEY) # First example: copy then read downloadedFile = client.download_file(YOUR_BUCKET, YOUR_DATA_FILE,YOUR_DATA_FILE) pd.read_csv(YOUR_DATA_FILE) # Second example: read in situ response = client.get_object(Bucket=YOUR_BUCKET,Key=YOUR_DATA_FILE) file = response["Body"].read() test2 = pd.read_csv(io.BytesIO(file), header=1, delimiter=",", low_memory=False) test2.head(5)
R or R Studio
To access S3 data from with R or R Studio, you will need two packages installed. You can add a custom dockerfile command like so:
RUN R -e 'install.packages(c("httr","xml2"), repos="https://cran.r-project.org")' RUN R -e 'install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"))'
Or from within R Studio, you can manually install them:
install.packages(c("httr","xml2"),repos="https://cran.r-project.org") install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"))
From here, the same basic principle of adding your user key, secret key and so forth still applies. Here is an example statement that retrieves a csv file and saves it to the file system for future use by R:
Sys.setenv("AWS_ACCESS_KEY_ID" = "YOURACCESSKEY", "AWS_SECRET_ACCESS_KEY" = "YOURSECRETKEY", "AWS_DEFAULT_REGION" = "us-west-1") fname <- "ANS3-FILE.csv" sname <- paste("/",fname,sep = "") save_object(sname, bucket="YOURBUCKETNAME",file = fname)