Streaming query processing with Apache Kafka and Apache Spark (Java)

Introduction

This is intended to be an example of how an application can process data from Apache Kafka with Apache Spark on OpenShift. This application is based on the Graf Zahl tutorial with the primary difference being the choice of language, in this case it is written in Java.

jGraf Zahl will count, as she does, the words on an Apache Kafka topic and display the top-k in a web page for the user. There isn’t much more to her, as you might expect.

Architecture

jGraf Zahl is composed of a single pod that serves both as a stream processor as well as web server, using Spark Framework(not to be confused with Apache Spark). A production-ready application would separate the processor from the web UI by an operational data store, i.e. a database or in-memory data grid.

For jGraf Zahl to have anything to do she needs some data to consume. To help her out here we also provide the following services.

Apache Kafka is provided by Strimzi. Basic instructions for setting up Strimzi 0.1 are provided in this document. To experiment with the latest Strimzi version please refer to the official documentation.
A source of some data to count. For this we provide a word fountain, generating words to the topic that jGraf Zahl will consume.

Installation

Installing and deploying jGraf Zahl utilizes Oshinko S2I, specifically the Oshinko java builder. S2I is a technology for taking a source repository that has a specific layout and building it into a container image that is then deployed as a pod on OpenShift.

The word fountain component will also use S2I, but the default python builder provided by OpenShift, because there is no dependency on Apache Spark.

Apache Kafka is more similar to infrastructure for the other components and not an application itself, so instead of using S2I, it is directly deployed from a template and pre-built container images.

First, make sure you are connected to an OpenShift cluster and are in a project with Oshinko installed. See Get Started if you need help.

Second, load the Apache Kafka infrastructure components into your project and start them. Since the following command will initialize both the Kafka and Zookeeper servers, you might want to wait a moment before proceeding to the next step.

oc create -f https://raw.githubusercontent.com/strimzi/strimzi-kafka-operator/0.1.0/kafka-inmemory/resources/openshift-template.yaml
oc new-app strimzi

Third, launch the word fountain, so jGraf Zahl will have something to count. The word fountain uses the SERVERS environment variable to find the Apache Kafka deployment to use. In the second step, when you created strimzi you created a service named kafka on port 9092. Note: The first time this step and the next run you’ll have to wait for the builder images to be pulled down from the internet, so if you’re on a thin pipe you may want to start both at the same time and grab a drink.

oc new-app openshift/python-27-centos7~https://github.com/mattf/word-fountain -e SERVERS=kafka:9092

Fourth, launch jGraf Zahl herself, using the Oshinko java S2I builder.

oc new-app --template=oshinko-java-spark-build-dc \
           -p APPLICATION_NAME=jgrafzahl \
           -p GIT_URI=https://github.com/radanalyticsio/jgrafzahl \
           -p APP_MAIN_CLASS=io.radanalytics.jgrafzahl.App \
           -p APP_ARGS='kafka:9092 word-fountain' \
           -p SPARK_OPTIONS='--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0,com.sparkjava:spark-core:2.7.2,org.glassfish:javax.json:1.0.4  --conf spark.jars.ivy=/tmp/.ivy2'

Finally, expose jGraf Zahl’s web UI so you can connect to it with a browser.

oc expose svc/jgrafzahl

Usage

Once installed, running and exposed, navigate to the jGraf Zahl web UI via the OpenShift Console.

Expansion

You can fork this application as a starting point for your own stream processing application with Kafka.

Contents