2018

Why Data Scientists Love Kubernetes

Sophie Watson & William Benton

Building an Implicit Recommendation Engine with Spark

Sophie Watson

Extending Structured Streaming Made Easy with Algebra

Erik Erlandson

Apache Spark for Library Developers (Deep Dive Part 2)

Erik Erlandson & William Benton

Apache Spark for Library Developers (Deep Dive Part 1)

Erik Erlandson & William Benton

Spark+AI Summit EU • London, England • October 2018

This is part 1 of a 2-session deep dive, which covers:

Basic considerations for reusable Spark code
Generic functions for parallel collections

As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.

You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.

We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

Presentation media Slide deck

From Research to Production: What they didn’t teach you in Grad School

Sophie Watson

Building Streaming Recommendation Engines on Spark

Rui Vieira

Apache Spark from notebook to cloud native application

Rebecca Simmonds

Intelligent applications on OpenShift from prototype to production

Rebecca Simmonds and Michael McCune

Pythonic Apache Spark app patterns for the cloud

Michael McCune

Probabilistic Structures for Scalable Computing

William Benton

Collaborative Filtering Microservices on Spark

Rui Vieira, Sophie Watson

2017

Containerizing TensorFlow Applications on OpenShift

Subin Modeel

One-Pass Data Science in Apache Spark with Generative T-Digests

Erik Erlandson

Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud

Michael McCune

Building Machine Learning Algorithms on Apache Spark

William Benton

Analyzing Blockchain transaction graph with Spark

Jirka Kremser

From notebooks to cloud native: a modern path for data driven applications

Michael McCune

The Revolution Will Be Containerized • Architecting the Intelligent Applications of Tomorrow

William Benton

Smart Scalable Feature Reduction With Random Forests

Erik Erlandson

Converging insightful, data-led applications with traditional web applications

Michael McCune, Steve Pousty

Sketching Data with T-Digest In Apache Spark

Erik Erlandson

Optimizing Spark Deployments for Containers: Isolation, Safety, and Performance

William Benton

Teaching Apache Spark Clusters to Manage Their Workers Elastically

Erik Erlandson, Trevor Mckay

Big Data In Production: Bare Metal to OpenShift

William Benton

Insightful Apps with Apache Spark and OpenShift

William Benton, Michael McCune

Building My Own Little World with Open Data

Steven Pousty

Building Cloud Native Apache Spark Applications with OpenShift

Michael McCune

2016

Building Apache Spark Application Pipelines for the Kubernetes Ecosystem

Michael McCune

Converging Big Data and Application Infrastructure

Steve Pousty

Running Apache Spark Natively on Kubernetes with OpenShift

Erik Erlandson

Containerized Spark on Kubernetes

William Benton

Big Data and Apache Spark on OpenShift Pt. II

William Benton

Big Data and Apache Spark on OpenShift Pt. I

William Benton

Analyzing Log Data With Apache Spark

William Benton

2015

Diagnosing Open-Source Community Health with Spark

William Benton

2014

Analyzing endurance-sports activity data with Spark

William Benton

Development

GitHub Organization

Presentations

2018

Why Data Scientists Love Kubernetes

Building an Implicit Recommendation Engine with Spark

Extending Structured Streaming Made Easy with Algebra

Apache Spark for Library Developers (Deep Dive Part 2)

Apache Spark for Library Developers (Deep Dive Part 1)

From Research to Production: What they didn’t teach you in Grad School

Building Streaming Recommendation Engines on Spark

Apache Spark from notebook to cloud native application

Intelligent applications on OpenShift from prototype to production

Pythonic Apache Spark app patterns for the cloud

Probabilistic Structures for Scalable Computing

Collaborative Filtering Microservices on Spark

2017

Containerizing TensorFlow Applications on OpenShift

One-Pass Data Science in Apache Spark with Generative T-Digests

Fire in the Sky: An Introduction to Monitoring Apache Spark in the Cloud

Building Machine Learning Algorithms on Apache Spark

Analyzing Blockchain transaction graph with Spark

From notebooks to cloud native: a modern path for data driven applications

The Revolution Will Be Containerized • Architecting the Intelligent Applications of Tomorrow

Smart Scalable Feature Reduction With Random Forests

Converging insightful, data-led applications with traditional web applications

Sketching Data with T-Digest In Apache Spark

Optimizing Spark Deployments for Containers: Isolation, Safety, and Performance

Teaching Apache Spark Clusters to Manage Their Workers Elastically

Big Data In Production: Bare Metal to OpenShift

Insightful Apps with Apache Spark and OpenShift

Building My Own Little World with Open Data

Building Cloud Native Apache Spark Applications with OpenShift

2016

Building Apache Spark Application Pipelines for the Kubernetes Ecosystem

Converging Big Data and Application Infrastructure

Running Apache Spark Natively on Kubernetes with OpenShift

Containerized Spark on Kubernetes

Big Data and Apache Spark on OpenShift Pt. II

Big Data and Apache Spark on OpenShift Pt. I

Analyzing Log Data With Apache Spark

2015

Diagnosing Open-Source Community Health with Spark

2014

Analyzing endurance-sports activity data with Spark