Apache Spark Cluster with Kubernetes and Docker - Part 1

Dec 9, 2015 00:00 · 777 words · 4 minutes read apache spark docker kubernetes

If you’re into Data Science and Analytics, you’ve probably heard of Apache Spark. I’ve been playing with this framework on and off for a while and completed two very interesting MOOCs on Spark; recently I decided

First things first

Click this link and follow the instructions in the section “Before you begin”: Create a google account, enable billing1, install the gcloud command line interface (CLI) and install docker locally. kubectl should also be installed:

gcloud components update kubectl

As a second step, create a new project in the Google Developers Console, give it an appropriate name (e.g. project-spark-k8s),

Create New Project -1

if necessary, assign a unique ProjectID2

Create New Project - 2

and set this project as default using the CLI:

gcloud config set project "project-spark-k8s-vv2"

Then go again to the Before you begin link and enable the Google Container and Google Compute Engine APIs:

Enable Container API

Finally create a local folder where you’ll work:

mkdir -p KubernetesSpark/examples/spark

and download the following files in it:

cd KubernetesSpark/examples/spark
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/spark-master-controller.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/spark-master-controller.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/spark-master-service.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/spark-webui.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/spark-worker-controller.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/zeppelin-controller.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/v1.2.0-alpha.5/examples/spark/zeppelin-service.yaml

Create a cluster

To create a cluster run the following command:

gcloud container --project "project-spark-k8s" clusters create spark-cluster --num-nodes 4 --zone europe-west1-c --machine-type n1-standard-1 --scope "https://www.googleapis.com/auth/compute"

As you can see, the cluster created, named “spark-cluster”, consists of four n1-standard-1 nodes. To get more information about it, type: To check the names of the four instances and get basic information about them (e.g. internal IP, external IP, etc), type:

gcloud compute instances list

Start Master Service

NOTE: This documentation is heavily based on https://github.com/kubernetes/kubernetes/tree/master/examples/spark.

First create a replication controller running the Spark Master service:

kubectl create -f examples/spark/spark-master-controller.yaml

To find out details about the replication controller, first find its exact name:

kubectl get pods
This command returns something like this:
NAME                            READY     STATUS    RESTARTS   AGE
spark-master-controller-xyz12   1/1       Running   0          1h
Then run:
kubectl describe pods spark-master-controller-xyz12
which returns information about the image, node, label etc. To check the log of the service, run:
kubectl logs spark-master-controller-xyz12

Here a sample of the log output:

Master Controller Logs

As a second step after the creation of the replication controller, create a logical service endpoint that Spark workers can use to access the Master pod:

kubectl create -f examples/spark/spark-master-service.yaml
and then create a service for the Spark Master WebUI:
kubectl create -f examples/spark/spark-webui.yaml
To get basic information about the running services, run:
kubectl get services

Get Services

For more detailed information about each service, run the corresponding describe command, e.g.

kubectl describe service spark-master
To connect to the Spark WebUI, use the cluster proxy:
kubectl proxy --port=8001
at which point the UI is available at: http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/.

WebUI Init

To make sure the command line is accessible press Ctrl-Z and then type bg. This way the process can run in the background, while other commands can be run in the foreground.

Start the Spark Workers

NOTE: The Spark workers need the Master service to be running.

First create a replication controller that manages the worker pods:

kubectl create -f examples/spark/spark-worker-controller.yaml
To see that the workers are running, type again:
kubectl get pods
The output now contains not only the spark-master-controller pod, but also the spark-worker-controllers. To check the logs run again:
kubectl logs spark-master-controller-xyz12
in which case the output looks like that:

Master Controller Logs

The WebUI now looks like that:

Image of WebUI

Start the Zeppelin UI

The Zeppelin UI pod will be used to launch jobs into the Spark cluster. To start it, run:

kubectl create -f examples/spark/zeppelin-controller.yaml
To check whether Zeppelin is running:
kubectl get pods -lcomponent=zeppelin
The output of this last command contains the name of the Zeppelin pod, e.g. zeppelin-controller-xyz12. To port-forward the Zeppelin port run the following command:
kubectl port-forward zeppelin-controller-xyz12 8080:8080
Zeppelin can now be found at http://localhost:8080:

Zeppelin Initial Screen

As noted in https://github.com/kubernetes/kubernetes/tree/master/examples/spark “On GKE, kubectl port-forward may not be stable over long periods of time. If you see Zeppelin go into Disconnected state (there will be a red dot on the top right as well), the port-forward probably failed and needs to be restarted.”

To access the notebook externally, the service should be exposed:

kubectl expose rc zeppelin-controller --type="LoadBalancer"

Run the following command to find the external IP:

kubectl get services zeppelin-controller

ExternalIP

That means that Zeppelin can also be found at http://104.155.67.27:80803

In the next part of this tutorial, we’ll create a GCE persistent disk to persistently store the Zeppelin notebooks we create in this cluster. This requires the creation of a new docker image for Zeppelin and a new yaml file for the Zeppelin controller.

For now, enjoy!


  1. The first $300 are on Google. [return]
  2. Note that the ProjectID is a globally unique identifier. [return]
  3. Important: Note that everyone with this IP address has access to this notebook! [return]