Analyzing UFO Dataset stored on Ceph Object Storage using PySpark & JupyterHub :: All running on Mothership (K8s)

Karan Singh
7 min readApr 15, 2019

--

In the last blog, I have explained the step-by-step procedure of setting up an OpenShift 4 cluster followed by setting up a Rook Operator based Ceph cluster running on OpenShift. If you recall we have created the Ceph Object Storage endpoint/user, I will repurpose that as external Ceph S3 endpoint.

In this blog, I will use minikube for computing and use the existing Ceph Object Storage S3 endpoints created in last blog (based on Rook Ceph on OpenShift4)

Let’s start with some million dollar questions

What are the TOP countries, cities to report maximum UFO sighting? What’s the shape of an UFO ?

By the end of this blog, you will have your answers, based on empirical evidence.

Step : 1 : Launch a Minikube cluster

  • Setup Minikube cluster with the dashboard (optional)
minikube start --kubernetes-version v1.14.1 --cpus 4 --memory 3000 -p minikube-v1.14.1minikube addons enable dashboard -p minikube-v1.14.1minikube addons enable ingress -p minikube-v1.14.1
minikube addons list -p minikube-v1.14.1
minikube dashboard -p minikube-v1.14.1
kubectl get po --all-namespaces
  • (Optional) verify the minikube cluster (notes to myself)
kubectl run hello-minikube --image=k8s.gcr.io/echoserver:1.4 --port=8080kubectl expose deployment echoserver --type=NodePortminikube service hello-minikube -p minikube-v1.14.1minikube ip -p  minikube-v1.14.1curl $(minikube ip -p  minikube-v1.14.1):30246

Step : 3: Launch JupyterHub with PySpark support, so that we can use spark to crunch the UFO dataset and visualize the results using Jupyter Notebook.

  • Setup Helm
# Get cmd line
curl https://raw.githubusercontent.com/kubernetes/helm/master/scripts/get | bash

# Create service account
kubectl --namespace kube-system create serviceaccount tiller

# Enable RBAC
kubectl create clusterrolebinding tiller --clusterrole cluster-admin --serviceaccount=kube-system:tiller

# Set up Helm on the cluster
helm init --service-account tiller

# Verify
helm version

# Secure Helm
kubectl --namespace=kube-system patch deployment tiller-deploy --type=json --patch='[{"op": "add", "path": "/spec/template/spec/containers/0/command", "value": ["/tiller", "--listen=localhost:44134"]}]'
  • Generate a random hex string representing 32 bytes to use as a security token
openssl rand -hex 32
  • Create and start editing a file called config.yaml , add random key as a security token.
proxy:
secretToken: "7e9f0d3b2c7d91dd20f97cd4343b0069ca70b7d3dc40a32053fa3e128c7a5b28"
singleuser:
image:
name: jupyter/all-spark-notebook
tag: latest
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo update
  • Now install the chart configured by your config.yaml by running this command from the directory that contains your config.yaml
kubectl create ns jhubhelm search jupyterhubRELEASE=jhub
NAMESPACE=jhub
helm upgrade --install $RELEASE jupyterhub/jupyterhub \
--version=0.9-fbabecf \
--namespace=$NAMESPACE \
--timeout=3600 \
-f config.yaml
  • The output should look something like this
$ helm upgrade --install $RELEASE jupyterhub/jupyterhub \
> --version=0.9-fbabecf \
> --namespace=$NAMESPACE \
> --timeout=3600 \
> -f config.yaml
Release "jhub" does not exist. Installing it now.
NAME: jhub
LAST DEPLOYED: Sun Apr 14 11:22:34 2019
NAMESPACE: jhub
STATUS: DEPLOYED
RESOURCES:
==> v1/ConfigMap
NAME DATA AGE
hub-config 1 1s
==> v1/Deployment
NAME READY UP-TO-DATE AVAILABLE AGE
hub 0/1 1 0 1s
proxy 0/1 1 0 1s
==> v1/PersistentVolumeClaim
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
hub-db-dir Bound pvc-d9e36b52-5e96-11e9-8ab6-08002749c742 1Gi RWO standard 1s
==> v1/Pod(related)
NAME READY STATUS RESTARTS AGE
hub-694884d94-47kq8 0/1 ContainerCreating 0 1s
proxy-56cd8d8695-rc8hd 0/1 ContainerCreating 0 1s
==> v1/Role
NAME AGE
hub 1s
==> v1/RoleBinding
NAME AGE
hub 1s
==> v1/Secret
NAME TYPE DATA AGE
hub-secret Opaque 2 1s
==> v1/Service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hub ClusterIP 10.108.254.178 <none> 8081/TCP 1s
proxy-api ClusterIP 10.105.122.30 <none> 8001/TCP 1s
proxy-public LoadBalancer 10.104.33.189 <pending> 80:32226/TCP,443:30690/TCP 1s
==> v1/ServiceAccount
NAME SECRETS AGE
hub 1 1s
==> v1/StatefulSet
NAME READY AGE
user-placeholder 0/0 1s
==> v1beta1/PodDisruptionBudget
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
hub 1 N/A 0 1s
proxy 1 N/A 0 1s
user-placeholder 0 N/A 0 1s
user-scheduler 1 N/A 0 1s
NOTES:
Thank you for installing JupyterHub!
Your release is named jhub and installed into the namespace jhub.You can find if the hub and proxy is ready by doing:kubectl --namespace=jhub get podand watching for both those pods to be in status 'Ready'.You can find the public IP of the JupyterHub by doing:kubectl --namespace=jhub get svc proxy-publicIt might take a few minutes for it to appear!Note that this is still an alpha release! If you have questions, feel free to
1. Read the guide at https://z2jh.jupyter.org
2. Chat with us at https://gitter.im/jupyterhub/jupyterhub
3. File issues at https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues
  • Open JupyterHub by grabbing any of the proxy-public URL
minikube service list -p minikube-v1.14.1
  • Log into Jupyterhub with any username and password
  • Once you are logged into the jupyterhub, import the pyspark-test.ipynb notebook
wget https://raw.githubusercontent.com/ksingh7/analyzing_ufo_dataset_with_spark/master/pyspark-test.ipynb
  • Run the notebook, if successful, that means that PySpark is properly running

At this point we have Jupyterhub setup done correctly on K8s and its ready to be used with Spark (PySpark)

Step : 4:

In the next steps, we will import a new Jupyter notebook, which is pre-set up with PySpark context, pull the UFO dataset from my Github account and later ingest this to Ceph Object Storage Cluster and perform analysis using PySpark.

  • Import spark_ceph_UFO_dataset_analysis.ipnb and upload this to JupyterHub. You can run each cell of the notebook to perform the analysis.
wget https://raw.githubusercontent.com/ksingh7/analyzing_ufo_dataset_with_spark/master/UFO_Data_analysis_Spark_Ceph.ipnb
  • Download the dataset and install boto and plotly binaries
  • Using S3 boto library, setup Ceph RGW user access_key, secret_key and Ceph Object Storage Endpoint details.
  • After running this notebook Cell, Python will connect to the Ceph S3 endpoint and list all the buckets owned by this user. This is kind of a functional test to verify connectivity to Ceph cluster.
  • Two steps back we had downloaded UFO dataset, now it's the time to upload the dataset to Ceph object storage. So that Spark can read the dataset into data frames and make it available in-memory for analysis.
  • Creating a Spark context by creating a spark session

The often asked question is “How analytics tools like Spark can connect and consume datasets directly from Remote Object Storage systems

The answer resides in threee magical letters “S3A

Hadoop’s “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations and Red Hat Ceph Storage is one of them. Hadoop common out-of-the box support S3A filesystem, its just a matter of configuring keys and endpoings. (Read more about S3A)

  • Load dataset directly from Ceph Object Storage into Spark Data Frame
  • Show me the Schema baby !

Summary — 1 : As per the results, American’s reported the most UFO sightings ..

Well … i don’t know, they must be sitting next the the window forever to count UFOs … or .. its the Hollywood mindset of sci-fi movies ;)

Key Take Away : Next time you visit the US … don’t do anything, just stare the sky, you might be lucky ;) ( recommendation, based on the data)

  • I drilled down and found that Seattle reported the most UFO.

Is it because Seattle have offices in sky scrappers with large glass walls.

So, UFO looks like a light, well we know this from movies. Now the data confirms this belief.

Fun of the day : Must read all x-axis labels , my favourite one is cigar ;)

  • The last one, kinda expected.

Summary : The idea was to play around multiple technologies, configure each of them the hard way (learning by doing).

We have used K8s(minikube), Ceph Storage (rook-ceph), OpenShift, Jupyter Notebooks (JupyterHub), Spark (PySpark), Kaggle (UFO dataset) and time.

Hope this would be a fun exercise …. !!

--

--

Karan Singh
Karan Singh

Written by Karan Singh

Co-Founder & CTO @ Scogo AI ♦ I Love to solve problems using Tech

No responses yet