Analyzing UFO Dataset stored on Ceph Object Storage using PySpark & JupyterHub :: All running on Mothership (K8s)

7 min readApr 15, 2019

In the last blog, I have explained the step-by-step procedure of setting up an OpenShift 4 cluster followed by setting up a Rook Operator based Ceph cluster running on OpenShift. If you recall we have created the Ceph Object Storage endpoint/user, I will repurpose that as external Ceph S3 endpoint.

In this blog, I will use minikube for computing and use the existing Ceph Object Storage S3 endpoints created in last blog (based on Rook Ceph on OpenShift4)

Let’s start with some million dollar questions

What are the TOP countries, cities to report maximum UFO sighting? What’s the shape of an UFO ?
By the end of this blog, you will have your answers, based on empirical evidence.

Step : 1 : Launch a Minikube cluster

Setup Minikube cluster with the dashboard (optional)

minikube start --kubernetes-version v1.14.1 --cpus 4 --memory 3000 -p minikube-v1.14.1minikube addons enable dashboard -p minikube-v1.14.1minikube addons enable ingress -p minikube-v1.14.1
minikube addons list -p minikube-v1.14.1minikube dashboard -p minikube-v1.14.1
kubectl get po --all-namespaces

(Optional) verify the minikube cluster (notes to myself)

kubectl run hello-minikube --image=k8s.gcr.io/echoserver:1.4 --port=8080kubectl expose deployment echoserver --type=NodePortminikube service hello-minikube -p minikube-v1.14.1minikube ip -p  minikube-v1.14.1curl $(minikube ip -p  minikube-v1.14.1):30246

Step : 3: Launch JupyterHub with PySpark support, so that we can use spark to crunch the UFO dataset and visualize the results using Jupyter Notebook.

Setup Helm

# Get cmd line
curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash

# Create service account
kubectl --namespace kube-system create serviceaccount tiller

# Enable RBAC
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller

# Set up Helm on the cluster
helm init --service-account tiller

# Verify
helm version

# Secure Helm
kubectl --namespace=kube-system patch deployment tiller-deploy --type=json --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'

Generate a random hex string representing 32 bytes to use as a security token

openssl rand -hex 32

Create and start editing a file called config.yaml , add random key as a security token.

proxy:
  secretToken: "7e9f0d3b2c7d91dd20f97cd4343b0069ca70b7d3dc40a32053fa3e128c7a5b28"singleuser:
  image:
    name: jupyter/all-spark-notebook
    tag: latest

Make Helm aware of the JupyterHub Helm chart repository

helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo update

Now install the chart configured by your config.yaml by running this command from the directory that contains your config.yaml

kubectl create ns jhubhelm search jupyterhubRELEASE=jhub
NAMESPACE=jhubhelm upgrade --install $RELEASE jupyterhub/jupyterhub \
    --version=0.9-fbabecf \
    --namespace=$NAMESPACE  \
    --timeout=3600 \
    -f config.yaml

The output should look something like this

$ helm upgrade --install $RELEASE jupyterhub/jupyterhub \
>     --version=0.9-fbabecf \
>     --namespace=$NAMESPACE  \
>     --timeout=3600 \
>     -f config.yaml
Release "jhub" does not exist. Installing it now.NAME:   jhub
LAST DEPLOYED: Sun Apr 14 11:22:34 2019
NAMESPACE: jhub
STATUS: DEPLOYEDRESOURCES:
==> v1/ConfigMap
NAME        DATA  AGE
hub-config  1     1s==> v1/Deployment
NAME   READY  UP-TO-DATE  AVAILABLE  AGE
hub    0/1    1           0          1s
proxy  0/1    1           0          1s==> v1/PersistentVolumeClaim
NAME        STATUS  VOLUME                                    CAPACITY  ACCESS MODES  STORAGECLASS  AGE
hub-db-dir  Bound   pvc-d9e36b52-5e96-11e9-8ab6-08002749c742  1Gi       RWO           standard      1s==> v1/Pod(related)
NAME                    READY  STATUS             RESTARTS  AGE
hub-694884d94-47kq8     0/1    ContainerCreating  0         1s
proxy-56cd8d8695-rc8hd  0/1    ContainerCreating  0         1s==> v1/Role
NAME  AGE
hub   1s==> v1/RoleBinding
NAME  AGE
hub   1s==> v1/Secret
NAME        TYPE    DATA  AGE
hub-secret  Opaque  2     1s==> v1/Service
NAME          TYPE          CLUSTER-IP      EXTERNAL-IP  PORT(S)                     AGE
hub           ClusterIP     10.108.254.178  <none>       8081/TCP                    1s
proxy-api     ClusterIP     10.105.122.30   <none>       8001/TCP                    1s
proxy-public  LoadBalancer  10.104.33.189   <pending>    80:32226/TCP,443:30690/TCP  1s==> v1/ServiceAccount
NAME  SECRETS  AGE
hub   1        1s==> v1/StatefulSet
NAME              READY  AGE
user-placeholder  0/0    1s==> v1beta1/PodDisruptionBudget
NAME              MIN AVAILABLE  MAX UNAVAILABLE  ALLOWED DISRUPTIONS  AGE
hub               1              N/A              0                    1s
proxy             1              N/A              0                    1s
user-placeholder  0              N/A              0                    1s
user-scheduler    1              N/A              0                    1sNOTES:
Thank you for installing JupyterHub!Your release is named jhub and installed into the namespace jhub.You can find if the hub and proxy is ready by doing:kubectl --namespace=jhub get podand watching for both those pods to be in status 'Ready'.You can find the public IP of the JupyterHub by doing:kubectl --namespace=jhub get svc proxy-publicIt might take a few minutes for it to appear!Note that this is still an alpha release! If you have questions, feel free to
  1. Read the guide at https://z2jh.jupyter.org
  2. Chat with us at https://gitter.im/jupyterhub/jupyterhub
  3. File issues at https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues

Open JupyterHub by grabbing any of the proxy-public URL

minikube service list -p minikube-v1.14.1

Log into Jupyterhub with any username and password

Once you are logged into the jupyterhub, import the pyspark-test.ipynb notebook

wget https://raw.githubusercontent.com/ksingh7/analyzing_ufo_dataset_with_spark/master/pyspark-test.ipynb

Run the notebook, if successful, that means that PySpark is properly running

At this point we have Jupyterhub setup done correctly on K8s and its ready to be used with Spark (PySpark)

Step : 4:

In the next steps, we will import a new Jupyter notebook, which is pre-set up with PySpark context, pull the UFO dataset from my Github account and later ingest this to Ceph Object Storage Cluster and perform analysis using PySpark.

Import spark_ceph_UFO_dataset_analysis.ipnb and upload this to JupyterHub. You can run each cell of the notebook to perform the analysis.

wget https://raw.githubusercontent.com/ksingh7/analyzing_ufo_dataset_with_spark/master/UFO_Data_analysis_Spark_Ceph.ipnb

Download the dataset and install boto and plotly binaries

Using S3 boto library, setup Ceph RGW user access_key, secret_key and Ceph Object Storage Endpoint details.
After running this notebook Cell, Python will connect to the Ceph S3 endpoint and list all the buckets owned by this user. This is kind of a functional test to verify connectivity to Ceph cluster.

Two steps back we had downloaded UFO dataset, now it's the time to upload the dataset to Ceph object storage. So that Spark can read the dataset into data frames and make it available in-memory for analysis.

Creating a Spark context by creating a spark session

The often asked question is “How analytics tools like Spark can connect and consume datasets directly from Remote Object Storage systems”
The answer resides in threee magical letters “S3A”
Hadoop’s “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations and Red Hat Ceph Storage is one of them. Hadoop common out-of-the box support S3A filesystem, its just a matter of configuring keys and endpoings. (Read more about S3A)

Load dataset directly from Ceph Object Storage into Spark Data Frame

Show me the Schema baby !

Summary — 1 : As per the results, American’s reported the most UFO sightings ..
Well … i don’t know, they must be sitting next the the window forever to count UFOs … or .. its the Hollywood mindset of sci-fi movies ;)
Key Take Away : Next time you visit the US … don’t do anything, just stare the sky, you might be lucky ;) ( recommendation, based on the data)

I drilled down and found that Seattle reported the most UFO.

Is it because Seattle have offices in sky scrappers with large glass walls.

So, UFO looks like a light, well we know this from movies. Now the data confirms this belief.
Fun of the day : Must read all x-axis labels , my favourite one is cigar ;)

The last one, kinda expected.

Summary : The idea was to play around multiple technologies, configure each of them the hard way (learning by doing).
We have used K8s(minikube), Ceph Storage (rook-ceph), OpenShift, Jupyter Notebooks (JupyterHub), Spark (PySpark), Kaggle (UFO dataset) and time.
Hope this would be a fun exercise …. !!

Analyzing UFO Dataset stored on Ceph Object Storage using PySpark & JupyterHub :: All running on Mothership (K8s)

Let’s start with some million dollar questions

Written by Karan Singh

No responses yet