Streaming query processing with Apache Kafka and Apache Spark (Python)

Introduction

This is intended to be an example of how an application can process data from Apache Kafka with Apache Spark on OpenShift.

Graf Zahl will count, as he does, the words on an Apache Kafka topic and display the top-k in a web page for the user. There isn’t much more to him, as you might expect.

Architecture

Graf Zahl is composed of a single pod that serves both as a stream processor as well as web server, using Flask. A production-ready application would separate the processor from the web UI by an operational data store, i.e. a database or in-memory data grid.

For Graf Zahl to have anything to do he needs some data to consume. To help him out here we also provide the following services.

Apache Kafka is provided by Strimzi. Basic instructions for setting up Strimzi 0.1 are provided in this document. To experiment with the latest Strimzi version please refer to the official documentation.
A source of some data to count. For this we provide a word fountain, generating words to the topic that Graf Zahl will consume.

Installation

Installing and deploying Graf Zahl utilizes Oshinko S2I, specifically the Oshinko pyspark builder. S2I is a technology for taking a source repository that has a specific layout and building it into a container image that is then deployed as a pod on OpenShift.

The word fountain component will also use S2I, but the default python builder provided by OpenShift, because there is no dependency on Apache Spark.

Apache Kafka is more similar to infrastructure for the other components and not an application itself, so instead of using S2I, it is directly deployed from a template and pre-built container images.

First, make sure you are connected to an OpenShift cluster and are in a project with Oshinko installed. See Get Started if you need help.

Second, load the Apache Kafka infrastructure components into your project and start them. Since the following command will initialize both the Kafka and Zookeeper servers, you might want to wait a moment before proceeding to the next step.

oc create -f https://raw.githubusercontent.com/strimzi/strimzi-kafka-operator/0.1.0/kafka-inmemory/resources/openshift-template.yaml
oc new-app strimzi

Third, launch the word fountain, so Graf Zahl will have something to count. The word fountain uses the SERVERS environment variable to find the Apache Kafka deployment to use. In the second step, when you created strimzi you created a service with the name kafka on port 9092. Note: The first time this step and the next run you’ll have to wait for the builder images to be pulled down from the internet, so if you’re on a thin pipe you may want to start both at the same time and grab a drink.

oc new-app openshift/python-27-centos7~https://github.com/radanalyticsio/word-fountain -e SERVERS=kafka:9092

Fourth, launch Graf Zahl himself, using the Oshinko pyspark S2I builder.

oc new-app --template=oshinko-python-spark-build-dc \
           -p APPLICATION_NAME=grafzahl \
           -p GIT_URI=https://github.com/radanalyticsio/grafzahl \
           -p APP_ARGS=--servers=kafka:9092 \
           -p SPARK_OPTIONS='--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0'

Finally, expose Graf Zahl’s web UI so you can connect to it with a browser.

oc expose svc/grafzahl

Usage

Once installed, running and exposed, navigate to the Graf Zahl web UI via the OpenShift Console.

Expansion

You can fork this application as a starting point for your own stream processing application with Kafka.