The radanalytics.io community has several ongoing projects with frequent releases. These are all collected in our GitHub organization. Each project addresses a specific concern within the OpenShift realm and provide solid solutions for your own data driven applications.
The following presentations are about the technologies involved in, or related to the radanalytics.io projects. We love our community and the passion they have for this technology, if you have a presentation, or know of one that would fit in here, please open a pull request and add it to the list!
In this presentation Michael will demonstrate how to create and deploy Python based Apache Spark applications to cloud native environments. We will explore design patterns to help you integrate your analytics and machine learning algorithms into applications which can take full advantage of cloud native platforms like OpenShift Origin. You will see code samples and live demonstrations of techniques for building and deploying Apache Spark applications written in Python. These samples and techniques will provide a solid basis that you can use to create your own intelligent applications for the cloud.
Collaborative filtering is a well known method to implement recommendation engines. Although modern techniques, such as Alternating Least Squares (ALS), allow us to perform rating predictions with large amounts of observations, typically ALS is implemented as a distributed batch algorithm where retraining must be performed with the entirety of the data. However, when dealing with large amounts of data as a stream, batch retraining might be problematic.
In this talk Rui will guide us in building a streaming ALS implementation using Apache Spark and based on Stochastic Gradient Descent, where training can be performed using observations as they arrive.
The advantages of real-time streaming collaborative filtering will be discussed as well as the scenarios where batch ALS might be preferable.
In this talk you’ll learn about streaming algorithms and approximate data structures to characterize data sources that are too big to keep around or difficult to replay. We’ll start simple, with an algorithm for on-line mean and variance estimates of a stream of samples. Then we’ll look at Bloom filters (for approximate set membership), count-min sketch (for approximate member count in a multiset), and HyperLogLog (for approximate set cardinality). We’ll cover implementing these algorithms, using them for data analysis (and even machine learning), and provide some intuition for why they work at scale. Come with reading knowledge of Python and leave with some cool new options in your scalable data processing toolbox!
Note that the YouTube video for this talk is audio-only; the actual talk was delivered without slides due to projector malfunction.
Rui Vieira, Sophie Watson
The Alternating Least Squares (ALS) algorithm is still deemed the industry standard in collaborative filtering. In this talk we will focus on Apache Spark’s ALS implementation and discuss the steps we took to build a distributed recommendation engine, focusing on continuous model training and model management.
We show that, by splitting the recommendation engine into microservices, we were able to reduce the system’s complexity and produce a robust collaborative filtering platform with support for continuous model training.
At the end of this talk, you should be equipped with enough tools and ideas to implement your own collaborative algorithm and avoid some common pitfalls.
Deep learning and GPU have become hot topics in recent times. TensorFlow has become a popular open source project for deep learning applications. But how can we use OpenShift for TensorFlow application development? In this presentation you will learn how to create custom container images with TensorFlow binaries, use Project Jupyter for TensorFlow model development, and deployment of those models in OpenShift. You will also learn how to use continuous integration for TensorFlow applications on Openshift. Learn all of this through examples with MNIST handwriting recognition, application of the Inception model, a neural style transfer with GPUs and transfer learning for celebrity detection.
The T-Digest has earned a reputation as a highly efficient and versatile sketching data structure; however, its applications as a fast generative model are less appreciated. Several common algorithms from machine learning use randomization of feature columns as a building block. Column randomization is an awkward and expensive operation when performed directly, but when implemented with generative T-Digests, it can be accomplished elegantly in a single pass that also parallelizes across Spark data partitions. In this talk Erik will review the principles of T-Digest sketching, and how T-Digests can be applied as generative models. He will explain how generative T-Digests can be used to implement fast randomization of columnar data, and conclude with demonstrations of T-Digest randomization applied to Variable Importance, Random Forest Clustering and Feature Reduction. Attendees will leave this talk with an understanding of T-Digest sketching, how T-Digests can be used as generative models, and insights into applying generative T-Digests to accelerate their own data science projects.
Writing intelligent cloud native applications is hard enough when things go well, but what happens when there are performance and debugging issues that arise during production? Inspecting the logs is a good start, but what if the logs don’t show the whole picture? Now you have to go deeper, examining the live performance metrics that are generated by Spark, or even deploying specialized microservices to monitor and act upon that data. Spark provides several built-in sinks for exposing metrics data about the internal state of its executors and drivers, but getting at that information when your cluster is in the cloud can be a time consuming and arduous process. In this presentation, Michael McCune will walk through the options available for gaining access to the metrics data even when a Spark cluster lives in a cloud native containerized environment. Attendees will see demonstrations of techniques that will help them to integrate a full-fledged metrics story into their deployments. Michael will also discuss the pain points and challenges around publishing this data outside of the cloud and explain how to overcome them. In this talk you will learn about: Deploying metrics sinks as microservices, Common configuration options, and Accessing metrics data through a variety of mechanisms.
There are many reasons why you might want to implement your own machine learning algorithms on Spark: you might want to experiment with a new idea, try and reproduce results from a recent research paper, or simply to use an existing technique that isn’t implemented in MLlib. In this talk, we’ll walk through the process of developing a new machine learning model for Spark. We’ll start with the basics, by considering how we’d design a parallel implementation of a particular unsupervised learning technique. The bulk of the talk will focus on the details you need to know to turn an algorithm design into an efficient parallel implementation on Spark: we’ll start by reviewing a simple RDD-based implementation, show some improvements, point out some pitfalls to avoid, and iteratively extend our implementation to support contemporary Spark features like ML Pipelines and structured query processing. You’ll leave this talk with everything you need to build a new machine learning technique that runs on Spark.
Cryptocurrencies attract various groups of people. Among other it could be investors, people from retail, tech enthusiasts, crypto-anarchists, etc. We are not going to focus on anything else than the raw technology behind the Blockchain, leaving aside all the ideology and hype that comes with the Bitcoin.
In this presentation we will show how the graph data can be processed in Spark. Blockchain binary data is transformed into large graph of transactions so that we can work with the graph from Spark using GraphX and GraphFrames libraries. The demo shows two notebooks with multiple examples of calculating interesting features of the transaction graph.
The GraphX based notebook uses the spark-notebook as the notebook technology, while the second one uses GraphFrames and Jupyter notebook. Also the second notebook connects to an existing spark cluster that was created by Oshinko tools.
The world of application development and deployment is changing rapidly with the advent of container-based orchestration platforms. Adjusting to these changes takes an open mind and a willingness to explore new techniques and methodologies. Notebook interfaces like Apache Zeppelin and Project Jupyter are excellent starting points for sketching out ideas and exploring data-driven algorithms, but where does the process lead after the notebook work has been completed? Combining the power and flexibility of notebooks with that of containers presents new opportunities to increase your productivity, such as creating processing clusters on demand, increased repeatability, and using continuous delivery techniques.
Michael McCune explains how to use notebook interfaces to create insightful data-driven demonstrations, which can then be ported directly into cloud-native applications, as he walks you through evolving an Apache Spark financial services application from a notebook to a microservice to a packaged container before finally deploying it through continuous delivery to a Kubernetes-backed platform. Along the way, Michael discusses the benefits and challenges that exist when migrating Apache Spark-based applications into containerized orchestration platforms.
Linux containers are increasingly popular with application developers: they offer improved elasticity, fault-tolerance, and portability between different public and private clouds, along with an unbeatable development workflow. It’s hard to imagine a technology that has had more impact on application developers in the last decade than containers, with the possible exception of ubiquitous analytics. Indeed, analytics is no longer a separate workload that occasionally generates reports on things that happened yesterday; instead, it pulses beneath the rhythms of contemporary business and supports today’s most interesting and vital applications. Since applications depend on analytic capabilities, it makes good sense to deploy our data-processing frameworks alongside our applications.
In this talk, you’ll learn from our expertise deploying Apache Spark and other data-processing frameworks in Linux containers on Kubernetes. We’ll explain what containers are and why you should care about them. We’ll cover the benefits of containerizing applications, architectures for analytic applications that make sense in containers, and how to handle external data sources. You’ll also get practical advice on how to ensure security and isolation, how to achieve high performance, and how to sidestep and negotiate potential challenges. Throughout the talk, we’ll refer back to concrete lessons we’ve learned about containerized analytic jobs ranging from interactive notebooks to production applications. You’ll leave inspired and enabled to deploy high-performance analytic applications without giving up the security you need or the developer-friendly workflow you want.
Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.
In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.
Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.
Michael McCune, Steve Pousty
Data crunching and web serving have existed very separate worlds. Access by a web application to analysis required a long process of Extract, Transform, Load (ETL), database work, and imports and exports, as well as getting network and storage assistance. The rise of containers, orchestration, more cost-effective computing and networking has resulted in a convergence, creating the possibility of using the same hardware and, more importantly, clustering software to converge both types of workloads. In this session, we’ll discuss a high level vision of this approach with containers, Kubernetes, web servers, and Apache Spark. We’ll show a demo of how this convergence helps data analysis move from custom R or Python scripts on an analyst’s desktop to an accessible web app, while letting the analyst simultaneously constrain the analysis to prevent statistical overreach.
Developers love Linux containers, which neatly package up an application and its dependencies and are easy to create and share. However, this unbeatable developer experience hides some deployment challenges for real applications: how do you wire together pieces of a multi-container application? Where do you store your persistent data if your containers are ephemeral? Do containers really contain and isolate your application, or are they merely hiding potential security vulnerabilities? Are your containers scheduled across your compute resources efficiently, or are they trampling on one another?
Container application platforms like Kubernetes provide the answers to some of these questions. We’ll draw on expertise in Linux security, distributed scheduling, and the Java Virtual Machine to dig deep on the performance and security implications of running in containers. This talk will provide a deep dive into tuning and orchestrating containerized Spark applications. You’ll leave this talk with an understanding of the relevant issues, best practices for containerizing data-processing workloads, and tips for taking advantage of the latest features and fixes in Linux Containers, the JDK, and Kubernetes. You’ll leave inspired and enabled to deploy high-performance Spark applications without giving up the security you need or the developer-friendly workflow you want.
Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.
Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.
Erik Erlandson, Trevor Mckay
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
Apache Spark is one of the most exciting open-source data-processing frameworks today. It features a range of useful capabilities and an unusually developer-friendly programming model. However, the ease of getting a simple Spark application running can hide some of the challenges you might face while going from a proof of concept to a real-world application. This talk will distill our experiences as early adopters of Spark in production, present a case study where using Spark effectively provided huge benefits over legacy solutions, explain why we migrated from a dedicated Spark cluster to OpenShift, and provide concrete advice regarding:
how to integrate Spark with external data sources (including databases, in-memory data grids, and message queues),
how best to deploy and manage Spark in the cloud,
the tradeoffs of various archive storage options for Spark,
how to evaluate predictive models and make sense of the analytic components of insightful applications, and
integrating Spark into microservice applications on OpenShift
This talk assumes some familiarity with Apache Spark but will provide context for attendees who are new to Spark. You’ll learn from a seasoned Red Hat engineer with over three years of experience running Spark in production and contributing to the Spark community.
William Benton, Michael McCune
Nearly all of today’s most exciting applications are insightful applications: they employ machine learning and large-scale data processing to improve with longevity and popularity. It’s an easy bet that the important applications of tomorrow will be insightful as well. It’s also an easy bet that you’ll want to be deploying tomorrow’s applications on a contemporary container platform with a great developer workflow like OpenShift.
Insightful applications pose some new challenges for developers, but this hands-on workshop will show you how to navigate them confidently. You’ll learn how to develop an insightful application on OpenShift with Apache Spark from the ground up. We’ll cover:
architectures for analytic applications and microservices;
a crash course in Apache Spark, some data science techniques, and OpenShift;
how to deploy Apache Spark as part of an OpenShift application; and
building a data-driven application from the ground up.
This workshop is largely self-contained: the only prerequisite is some familiarity with Python. Learn from the experience of Red Hat emerging technology engineers who are focused on bringing data-driven application development to OpenShift!
Everybody cares about the place (they live, they grew up in, they had a great vacation, in the news….). With the rise of open data, big data tooling, and new visualisation technology, we can actually now build applications that give people new ways to explore beyond “where is the closest Starbucks”. I have collected Open Data from my home town (Santa Cruz, CA) and compiled it into the beginnings of a visualization and analysis platform. The goal of this talk is to show the process of collecting open data from disparate sources, some of the caveats on being able to put them together, general lessons learned, and some fun visualizations. I want to move past thinking about sources for open data and moving on to tools and lessons so you can get cracking! I want to show how we can enable people to gather open data and turn it to open knowledge. Data sources will be from Government (e.g. United States Geologic Survey) and Non-Government sources (e.g Audubon Society eBird Data) while some of the tools covered will be Apache Spark, PostGIS, Leaflet, and various others.
Apache Spark based applications are often comprised of many separate, interconnected components that are a good match for an orchestrated containerized platform like OpenShift which is built on Kubernetes. But with the increased flexibility afforded by these technologies comes a new set of challenges for building rich data-centric applications. Mike starts off with how to build Apache Spark Application pipelines and then walks thru a demo of building one on OpenShift. He also gave some great insights into the road ahead for Apache Spark on OpenShift.
Apache Spark based applications are often comprised of many separate, interconnected components that are a good match for an orchestrated containerized platform like Kubernetes. But with the increased flexibility afforded by these technologies comes a new set challenges for building rich data-centric applications.
In this presentation we will discuss techniques for building multi-component Apache Spark based applications that can be easily deployed and managed on a Kubernetes infrastructure. Building on experiences learned while developing and deploying cloud native applications on an OpenShift platform, we will explore common issues that arise during the engineering process and demonstrate workflows for easing the maintenance factors associated with complex installations.
For most of my lifetime in the computing world, data crunching and web serving were two very separate worlds. If a web app wanted access to the analysis there was a long process of ETL, DB works, imports and exports, and bribing various network and storage people for the resources you needed. With the rise of containers, orchestration, cheap computing and networking, and over 10 years of people tackling large problems at new scales we have finally come to a convergence. It is now possible for us to actually use the same hardware, and more importantly, clustering software to converge both types of workloads. I am going to lay out how this can look with Containers, Kubernetes, web servers, and Apache Spark. This can be considered a germ of what we can look to build in the future. I will demo this in action and show this is actually now achievable for mere mortals such as myself. Finally I will close with some thought experiments on what this can enable for the future. I know this is a keynote but I am hoping we can make it interactive with discussion and experience sharing!
Apache Spark can be made natively aware of Kubernetes by implementing a Spark scheduler back-end that can run Spark application Drivers and bare Executors in kubernetes pods. In this talk, Erik will explain the design of a native-Kubernetes scheduler back-end in Spark and demonstrate a Spark application submission with OpenShift.
Consider two recent trends in application development: more and more applications are taking advantage of architectures involving containerized microservices in order to enable improved elasticity, fault-tolerance, and scalability — whether in the public cloud or on-premise. In addition, analytic capabilities and scalable data processing have increasingly become a basic requirement for contemporary applications. The confluence of these trends suggests that there are a lot of good reasons to want to manage Spark with a container orchestration platform, but it’s not quite as simple as packaging up a standalone cluster in containers. This talk will present our team’s experiences migrating a production Spark cluster from a multi-tenant Mesos cluster to a shared compute resource managed by Kubernetes. We’ll explain the motivation behind microservices and containers and identify the architectures that make sense for containerized applications that depend on Spark. We’ll pay special attention to practical concerns of running Spark in containers, including networking, access control, persistent storage, and multitenancy. You’ll leave this talk with a better understanding of why you might want to run Spark in containers and some concrete ideas for how to get started doing it.
The first meeting of the Big Data Special Interest Group, expanded on a previous Commons session entitled Big Data and Apache Spark on OpenShift (Part 1) which kicked off the Big Data SIG.
In the previous session, Red Hat’s Will Benton gave us a vocabulary for talking about data-driven applications and outlined some example architectures for building data-driven applications with microservices. In this SIG session, he gave us an introduction to using Apache Spark on OpenShift and walk through an example data-driven application.
In this introductory Big Data briefing session, Red Hat’s Will Benton gave an overview into Big Data architecture and concepts to help level the playing field. This video will give us a better understanding of what a data-intensive application should actually look like on a modern container orchestration platform, and to help kick off the OpenShift Common Big Data SIG.
In this recording, you’ll learn about the anatomy of data-intensive applications, how they come to life, and what they have to accomplish. We walked through a few applications and explored their responsibilities, saw how they use data, discuss trade-offs they must negotiate, and point to some example architectures that make sense for realizing data-intensive applications on OpenShift.
Contemporary applications and infrastructure software leave behind a tremendous volume of metric and log data. This aggregated “digital exhaust” is inscrutable to humans and difficult for computers to analyze, since it is vast, complex, and not explicitly structured. This session will introduce the log processing domain and provide practical advice for analyzing log data with Apache Spark, including:
how to impose a uniform structure on disparate log sources;
machine-learning techniques to detect infrastructure failures automatically and characterize the text of log messages;
best practices for tuning Spark, training models against structured data, and ingesting data from external sources like ElasticSearch; and
a few relatively painless ways to visualize your results.
You’ll have a better understanding of the unique challenges posed by infrastructure log data after this session. You’ll also learn the most important lessons from our efforts both to develop analytic capabilities for an open-source log aggregation service and to evaluate these at enterprise scale.
Successful companies use analytic measures to identify and reward their best projects and contributors. Successful open source developers often make similar decisions when they evaluate whether or not to reward a project or community by investing their time. This talk will show how Spark enables a data-driven understanding of the dynamics of open source communities, using operational data from the Fedora Project as an example. With thousands of contributors and millions of users, Fedora is one of the world’s largest open-source communities. Notably, Fedora also has completely open infrastructure: every event related to the project’s daily operation is logged to a public messaging bus, and historical event data are available in bulk. We’ll demonstrate best practices for using Spark SQL to ingest bulk data with rich, nested structure, using ML pipelines to make sense of software community data, and keeping insights current by processing streaming updates.
Spark’s support for efficient execution and rapid interactive prototyping enable novel approaches to understanding data-rich domains that have historically been underserved by analytical techniques. One such field is endurance sports, where athletes are faced with GPS and elevation traces as well as samples from heart rate, cadence, temperature, and wattage sensors. These data streams can be somewhat comprehensible at any given moment, when looking at a small window of samples on one’s watch or cycle computer, but are overwhelming in the aggregate.
In this talk, I’ll present my recent efforts using Spark and MLLib to mine my personal cycling training data for deeper insights and help me design workouts to meet particular fitness goals. This work incorporates analysis of geographic and time-series data, computational geometry, visualization, and domain knowledge of exercise physiology. I’ll show how Spark made this work possible, demonstrate some novel techniques for analyzing fitness data, and discuss how these approaches could be applied to make sense of data from an entire community of cyclists.