Apache Spark Cluster with Kubernetes and Docker - Part 2

Feb 1, 2016 00:00 · 582 words · 3 minutes read apache spark docker kubernetes zeppelin

In the first part of this tutorial, we deployed a Spark Cluster on Google Compute Engine (GCE) and exposed a Zeppelin notebook service in order to provide access to the cluster and launch jobs via a notebook frontend.

The problem with this configuration is that the notebooks are stored locally in the machine we create (specifically under “${ZEPPELIN_HOME}/notebook”) and are lost the moment we shut the service down. In this second part, we create a GCE persistent disk to persistently store the Zeppelin notebooks. This requires the creation of a new docker image for Zeppelin and a new yaml file for the Zeppelin controller.

Before we begin

Let’s assume that all steps described in Part 1 have been completed. To deploy the new Zeppelin service, we first have to delete the existing one:

kubectl delete services zeppelin-controller
kubectl delete rc zeppelin-controller

Create a GCE persistent disk

Creating a persistent disk is a one-liner:

gcloud compute disks create spark-disk --size 20GB

To get information about this disk run:

gcloud compute disks describe spark-disk

Create a new docker image for Zeppelin

From the kubernetes repository download the following files:

cd KubernetesSpark/examples/spark 
mkdir -p images/zeppelin
cd images/zeppelin
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/images/zeppelin/Dockerfile
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/images/zeppelin/docker-zeppelin.sh
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/images/zeppelin/zeppelin-env.sh
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/images/zeppelin/zeppelin-log4j.properties

In the Dockerfile change the spark-base version from latest to _1.5.1v2.

#FROM gcr.io/google_containers/spark-base:latest  # before
FROM gcr.io/google_containers/spark-base:1.5.1_v2 # after

Then in the zeppelin-env.sh file modify the environment variable that defines where the Zeppelin notebooks are stored, e.g. export ZEPPELIN_NOTEBOOK_DIR=“/pd/notebook”.

Zeppelin Env Var

Build the new image and tag it appropriately (replace notiv by another username):

docker build -t notiv/zeppelin:1.5.1_v2 .
docker tag notiv/zeppelin:1.5.1_v2 eu.gcr.io/project-spark-k8s-vv2/zeppelin:1.5.1_v2

Note that the Zeppelin image version (1.5.1_v2) should match the versions of the Spark images.

Then upload the image to your registry (this can take a while, as the image is pretty big):

gcloud docker push eu.gcr.io/project-spark-k8s-vv2/zeppelin:1.5.1_v2

Update the yaml file

Now we modify the yaml file for the Zeppelin controller, so that a) it corresponds to the new docker image and b) the persistent disk we created is provisioned when we start the controller.

Zeppelin Yaml

In this gist you can find the code¹.

Start the Zeppelin UI

Now we can start the Zeppelin service, as we did in Part 1. Run:

kubectl create -f examples/spark/zeppelin-controller.yaml

To check whether Zeppelin is running:

kubectl get pods -lcomponent=zeppelin

The output of this last command contains the name of the Zeppelin pod, e.g. zeppelin-controller-xyz12. To port-forward the Zeppelin port run the following command:

kubectl port-forward zeppelin-controller-xyz12 8080:8080

Zeppelin can now be found at http://localhost:8080:

Zeppelin Initial Screen

To access the notebook externally, the service should be exposed:

kubectl expose rc zeppelin-controller --type="LoadBalancer"

Run the following command to find the external IP:

kubectl get services zeppelin-controller

ExternalIP

That means that Zeppelin can also be found at http://104.155.67.27:8080²

Final note

After setting up the cluster, the next thing I did, was to upload a dataset to play with. Here’s the process.

First, make sure you have gsutil installed³. If this is not the case, follow the instructions here. Then create a bucket with a unique name:

gsutil mb -c nearline -l europe-west1 -p project-spark-k8s-vv2 gs://notiv-spark-bucket

To upload a file (e.g. ~/linkage.csv) run the following command:

gsutil cp ~/linkage.csv gs://notiv-spark-bucket/data/linkage/linkage.csv

and to access the file in the Zeppelin notebook:

val linkage = sc.textFile("gs://notiv-spark-bucket/data/linkage/linkage.csv")

Don’t forget to change the ProjectID when you define the image. ^[return]
Important: As noted in the previous post, everyone with this IP address has access to this notebook! ^[return]
If you installed the Google Cloud SDK, then gsutil is already installed. ^[return]