Apache Spark has been a great driver of not only Scala adoption, but introducing a new generation of developers to functional programming concepts. As Spark places more emphasis on its newer DataFrame & Dataset APIs, it’s important to ask ourselves how we can benefit from this while still keeping our fun functional roots. We will explore the cases where the Dataset APIs empower us to do cool things we couldn’t before, what the different approaches to serialization mean, and how to figure out when the shiny new API is actually just trying to steal your lunch money (aka CPU cycles).
Building Recoverable (and optionally async) Pipelines with Apache Spark (+ s...Holden Karau
This document summarizes Holden Karau's presentation on building recoverable pipelines with Apache Spark. The presentation explored ways that Spark jobs can fail late, presented initial attempts to make a WordCount job recoverable, and discussed improvements to the approach using non-blocking saves and the Spark DAG. The presentation concluded with recommendations to replace WordCount with a real pipeline and clean up files, as well as links for learning more about Spark.
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
The document discusses accelerating big data processing beyond just the Java Virtual Machine (JVM). It introduces Rachel Warren and Holden Karau, the presenters. It then covers the current state of PySpark and its performance limitations due to serialization between Python and the JVM. Future improvements discussed include using Apache Arrow to accelerate UDFs, Dask for pure Python processing, and Apache Beam for additional languages. The presenters promote their new book on high performance Spark and take questions at the end.
Testing and validating distributed systems with Apache Spark and Apache Beam ...Holden Karau
As distributed data parallel systems, like Spark, are used for more mission-critical tasks, it is important to have effective tools for testing and validation. This talk explores the general considerations and challenges of testing systems like Spark through spark-testing-base and other related libraries.
With over 40% of folks automatically deploying the results of their Spark jobs to production, testing is especially important. Many of the tools for working with big data systems (like notebooks) are great for exploratory work, and can give a false sense of security (as well as additional excuses not to test). This talk explores why testing these systems are hard, special considerations for simulating "bad" partioning, figuring out when your stream tests are stopped, and solutions to these challenges.
Validating big data jobs - Spark AI Summit EUHolden Karau
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and its important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.
Figuring out when things have gone terribly wrong is trickier than it first appears, since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data.
For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF.
If you’ve seen Holden’s previous testing Spark talks this can be viewed as a deep dive on the second half focused around what else we need to do besides good testing practices to create production quality pipelines. If you haven’t seen the testing talks watch those on YouTube after you come see this one
The magic of (data parallel) distributed systems and where it all breaks - Re...Holden Karau
Distributed systems can seem magical, and sometimes all of the magic works and our job succeeds. However, if you've worked with them for a long enough time you've found a few places where the magic starts to break down and the fact that it's actually a collection of several hundred garden gnomes* rather than a single large garden gnome.
This talk will use Apache Spark, Beam, Flink, Kafka, and Map Reduce to explore the world of data parallel distributed systems. We'll start with some happy pieces of magic, like how we can combine different transformations into a single pass over the data, working between different languages, data partitioning, and lambda serialization. After each new piece of magic is introduced we'll look at how it breaks in one (or two) of the systems.
Come to be told it's not your fault everything is broken, or if your distributed software still works an exciting preview of everything that's going to go wrong. Don't work with distributed systems? Come to be reassured you've made good life choices.
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
This talk will introduce Apache Spark (one of the most popular big data tools), the different built ins (from SQL to ML), and, of course, everyone's favorite wordcount example. Once we've got the nice parts out of the way, we'll talk about some of the limitations and the work being undertaken to improve those limitations. We'll also look at the cases where Spark is more like trying to hammer a screw. Since we want to finish on a happy note, we will close out with looking at the new vectorized UDFs in PySpark 2.3.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in PySpark, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Debuggers are a wonderful tool, however when you have 100 computers the “wonder” can be a bit more like “pain”. This talk will look at how to connect remote debuggers, but also remind you that it’s probably not the easiest path forward.
Validating big data pipelines - FOSDEM 2019Holden Karau
This document discusses validating data pipelines built with Apache Spark and Apache Airflow. It emphasizes that tests are not perfect and failures will occur, so validation is important to minimize impacts. Simple validation rules can check for invalid records, changes in data distributions, and schema mismatches. Validation rules can run as separate Spark jobs and metrics from jobs can be compared against expected values. Airflow can coordinate validation jobs and check for anomalies before publishing results. Overall, the key is to have validation rules that alert infrequently but catch meaningful issues.
Validating big data pipelines - Scala eXchange 2018Holden Karau
Note: the link to the resource page should have been http://bit.ly/2QRVw0S
As big data jobs move from the proof-of-concept phase into powering real production services, you will need to consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data).
During this talk, you will discover that you will eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production). It's important to automatically recognise when things have gone wrong, so you can stop deployment before you have to update our resumes.
Figuring out when things have gone terribly wrong is trickier than it first appears, since you want to catch the errors before your users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist you in writing relative validation rules based on historical data. For folks working in streaming, you will learn about the unique challenges of attempting to validate in a real-time system, and what you can do besides keeping an up-to-date resume on file for when things go wrong.
You will discover code examples in Apache Spark, as well as learn about similar concepts in Apache BEAM (a cross platform tool), but the techniques should be applicable across systems.
Real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures
This document discusses contributing to Apache Spark. It provides an overview of finding issues to work on, the different components of Spark one could contribute to, and the process for contributing code changes through pull requests and code reviews. Key steps include searching Spark's JIRA issue tracker for starter issues, choosing a component to work in, making code and test changes, submitting a pull request for review, addressing review feedback, and getting the change merged once approved.
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Holden Karau
Big Data applications are increasingly being run on Kubernetes. Data scientists commonly use python-based workflows, with tools like PySpark and Jupyter for wrangling large amounts of data. The Kubernetes community over the past year has been actively investing in tools and support for frameworks such as Apache Spark, Jupyter and Apache Airflow. Attendees will learn how these tools can be used together to build a scalable self-service platform for data science on Kubernetes as well as the benefits that Kubernetes can provide over traditional options.
Getting started contributing to Apache SparkHolden Karau
Are you interested in contributing to Apache Spark? This workshop and associated slides walk through the basics of contributing to Apache Spark as a developer. This advice is based on my 3 years of contributing to Apache Spark but should not be considered official in any way.
Validating Big Data Pipelines - Big Data Spain 2018Holden Karau
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start considering what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and it’s important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.
Extending spark ML for custom models now with python!Holden Karau
Are you interested in adding your own custom algorithms to Spark ML? This is the talk for you! See the companion examples in the High Performance Spark, and Sparkling ML project.
Introduction to and Extending Spark MLHolden Karau
This document discusses extending Spark ML pipelines with custom estimators and transformers. It begins with an overview of Spark ML and the pipeline API. Then it demonstrates how to build a simple hardcoded word count transformer and configurable transformer. It discusses important aspects like transforming the input schema, parameters, and model fitting. The document provides guidance on configuration, persistence, serving models, and resources for learning more about custom Spark ML components.
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
This document summarizes Holden Karau's presentation on powering TensorFlow with big data using Apache Beam, Apache Spark, and Apache Flink. The presentation covers why deep learning requires large datasets for training, how to prepare features from big data for TensorFlow using TensorFlow Transform, and how TensorFlow Transform can run on Apache Beam and integrate feature preparation into model serving. It also discusses challenges in integrating Python and big data systems beyond the Java Virtual Machine and efforts to improve cross-language interoperability.
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
If you’re subscribed to [email protected], or work in a large company, you may see some common Spark error messages. Even attending Spark Summit over the past few years you have seen talks like the “Top K Mistakes in Spark.” While cool non-machine learning based tools do exist to examine Spark’s logs — they don’t use machine learning and therefore are not as cool but also limited in by the amount of effort humans can put into writing rules for them. This talk will look what happens when we train “regular” clustering models on stack traces, and explore DL models for classifying user message to the Spark list. Come for the reassurance that the robots are not yet able to fix themselves, and stay to learn how to work better with the help of our robot friends. The tl;dr of this talk is Spark ML on Spark output, plus a little bit of Tensorflow is fun for the whole family, but probably shouldn’t automatically respond to user list posts just yet.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
This document provides an introduction to Apache Spark, including:
- An overview of what Spark is and the types of problems it can solve
- A brief look at the Spark API through the word count example
- Details on Spark's core abstractions of RDDs and how transformations and actions work
- Potential pitfalls of using groupByKey and how reduceByKey is preferable
- Resources for learning more about Spark including books and video tutorials
Spark Autotuning Talk - Strata New YorkHolden Karau
This document discusses how to tune Apache Spark jobs for optimal performance. It begins with introductions of the presenters and an overview of what will be covered, including the most important Spark settings, using the auto tuner, examples of common errors that can be addressed by tuning, and collecting historical data. Examples are provided of how to address errors like out of memory issues by increasing resources or adjusting partitioning. While tuning can help with many issues, some problems like unnecessary shuffles or unbalanced data cannot be addressed without code changes.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
The document discusses Apache Spark Datasets and how they compare to RDDs and DataFrames. Some key points:
- Datasets provide better performance than RDDs due to a smarter optimizer, more efficient storage formats, and faster serialization. They also offer simplicity advantages over RDDs for things like windowed operations and multi-column aggregates.
- Datasets allow mixing of functional and relational styles more easily than RDDs or DataFrames. The optimizer has more information from Datasets' schemas and can perform optimizations like partial aggregation.
- Datasets address some of the limitations of DataFrames, making it easier to write UDFs and handle iterative algorithms. They provide a typed API compared to the untyped
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Validating big data pipelines - Scala eXchange 2018Holden Karau
Note: the link to the resource page should have been http://bit.ly/2QRVw0S
As big data jobs move from the proof-of-concept phase into powering real production services, you will need to consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data).
During this talk, you will discover that you will eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production). It's important to automatically recognise when things have gone wrong, so you can stop deployment before you have to update our resumes.
Figuring out when things have gone terribly wrong is trickier than it first appears, since you want to catch the errors before your users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist you in writing relative validation rules based on historical data. For folks working in streaming, you will learn about the unique challenges of attempting to validate in a real-time system, and what you can do besides keeping an up-to-date resume on file for when things go wrong.
You will discover code examples in Apache Spark, as well as learn about similar concepts in Apache BEAM (a cross platform tool), but the techniques should be applicable across systems.
Real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures
This document discusses contributing to Apache Spark. It provides an overview of finding issues to work on, the different components of Spark one could contribute to, and the process for contributing code changes through pull requests and code reviews. Key steps include searching Spark's JIRA issue tracker for starter issues, choosing a component to work in, making code and test changes, submitting a pull request for review, addressing review feedback, and getting the change merged once approved.
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
There are many great tools for training machine learning tools, ranging from sci-kit to Apache Spark, and tensorflow. However many of these systems largely leave open the question how to use our models outside of the batch world (like in a reactive application). Different options exist for persisting the results and using them for live training, and we will explore the trade-offs of the different formats and their corresponding serving/prediction layers.
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Holden Karau
Big Data applications are increasingly being run on Kubernetes. Data scientists commonly use python-based workflows, with tools like PySpark and Jupyter for wrangling large amounts of data. The Kubernetes community over the past year has been actively investing in tools and support for frameworks such as Apache Spark, Jupyter and Apache Airflow. Attendees will learn how these tools can be used together to build a scalable self-service platform for data science on Kubernetes as well as the benefits that Kubernetes can provide over traditional options.
Getting started contributing to Apache SparkHolden Karau
Are you interested in contributing to Apache Spark? This workshop and associated slides walk through the basics of contributing to Apache Spark as a developer. This advice is based on my 3 years of contributing to Apache Spark but should not be considered official in any way.
Validating Big Data Pipelines - Big Data Spain 2018Holden Karau
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start considering what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and it’s important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.
Extending spark ML for custom models now with python!Holden Karau
Are you interested in adding your own custom algorithms to Spark ML? This is the talk for you! See the companion examples in the High Performance Spark, and Sparkling ML project.
Introduction to and Extending Spark MLHolden Karau
This document discusses extending Spark ML pipelines with custom estimators and transformers. It begins with an overview of Spark ML and the pipeline API. Then it demonstrates how to build a simple hardcoded word count transformer and configurable transformer. It discusses important aspects like transforming the input schema, parameters, and model fitting. The document provides guidance on configuration, persistence, serving models, and resources for learning more about custom Spark ML components.
Debugging Spark: Scala and Python - Super Happy Fun Times @ Data Day Texas 2018Holden Karau
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Powering tensor flow with big data using apache beam, flink, and spark cern...Holden Karau
This document summarizes Holden Karau's presentation on powering TensorFlow with big data using Apache Beam, Apache Spark, and Apache Flink. The presentation covers why deep learning requires large datasets for training, how to prepare features from big data for TensorFlow using TensorFlow Transform, and how TensorFlow Transform can run on Apache Beam and integrate feature preparation into model serving. It also discusses challenges in integrating Python and big data systems beyond the Java Virtual Machine and efforts to improve cross-language interoperability.
Using Spark ML on Spark Errors - What do the clusters tell us?Holden Karau
If you’re subscribed to [email protected], or work in a large company, you may see some common Spark error messages. Even attending Spark Summit over the past few years you have seen talks like the “Top K Mistakes in Spark.” While cool non-machine learning based tools do exist to examine Spark’s logs — they don’t use machine learning and therefore are not as cool but also limited in by the amount of effort humans can put into writing rules for them. This talk will look what happens when we train “regular” clustering models on stack traces, and explore DL models for classifying user message to the Spark list. Come for the reassurance that the robots are not yet able to fix themselves, and stay to learn how to work better with the help of our robot friends. The tl;dr of this talk is Spark ML on Spark output, plus a little bit of Tensorflow is fun for the whole family, but probably shouldn’t automatically respond to user list posts just yet.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau
This document provides an introduction to Apache Spark, including:
- An overview of what Spark is and the types of problems it can solve
- A brief look at the Spark API through the word count example
- Details on Spark's core abstractions of RDDs and how transformations and actions work
- Potential pitfalls of using groupByKey and how reduceByKey is preferable
- Resources for learning more about Spark including books and video tutorials
Spark Autotuning Talk - Strata New YorkHolden Karau
This document discusses how to tune Apache Spark jobs for optimal performance. It begins with introductions of the presenters and an overview of what will be covered, including the most important Spark settings, using the auto tuner, examples of common errors that can be addressed by tuning, and collecting historical data. Examples are provided of how to address errors like out of memory issues by increasing resources or adjusting partitioning. While tuning can help with many issues, some problems like unnecessary shuffles or unbalanced data cannot be addressed without code changes.
Many of the recent big data systems, like Hadoop, Spark, and Kafka, are written primarily in JVM languages. At the same time, there is a wealth of tools for data science and data analytics that exist outside of the JVM. Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Holden and Rachel detail how to bridge the gap using PySpark and discuss other solutions like Kafka Streams as well. They also outline the challenges of pure Python solutions like dask. Holden and Rachel start with the current architecture of PySpark and its evolution. They then turn to the future, covering Arrow-accelerated interchange for Python functions, how to expose Python machine learning models into Spark, and how to use systems like Spark to accelerate training of traditional Python models. They also dive into what other similar systems are doing as well as what the options are for (almost) completely ignoring the JVM in the big data space.
Python users will learn how to more effectively use systems like Spark and understand how the design is changing. JVM developers will gain an understanding of how to Python code from data scientist and Python developers while avoiding the traditional trap of needing to rewrite everything.
Streaming & Scaling Spark - London Spark Meetup 2016Holden Karau
This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with a an introduction to Datasets with Structured Streaming (new in Spark 2.0) and how to do weird things with them.
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
The document discusses Apache Spark Datasets and how they compare to RDDs and DataFrames. Some key points:
- Datasets provide better performance than RDDs due to a smarter optimizer, more efficient storage formats, and faster serialization. They also offer simplicity advantages over RDDs for things like windowed operations and multi-column aggregates.
- Datasets allow mixing of functional and relational styles more easily than RDDs or DataFrames. The optimizer has more information from Datasets' schemas and can perform optimizations like partial aggregation.
- Datasets address some of the limitations of DataFrames, making it easier to write UDFs and handle iterative algorithms. They provide a typed API compared to the untyped
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
Spark is a general purpose distributed system for large-scale data processing. The presentation covers techniques for scaling Apache Spark jobs including caching and persisting RDDs, avoiding shuffle explosions using reduceByKey instead of groupByKey, and using Datasets for strongly typed operations. It also introduces structured streaming, a new feature in Spark 2.0 for building continuous data pipelines on streaming data.
Beyond shuffling - Scala Days Berlin 2016Holden Karau
This session will cover our & community experiences scaling Spark jobs to large datasets and the resulting best practices along with code snippets to illustrate.
The planned topics are:
Using Spark counters for performance investigation
Spark collects a large number of statistics about our code, but how often do we really look at them? We will cover how to investigate performance issues and figure out where to best spend our time using both counters and the UI.
Working with Key/Value Data
Replacing groupByKey for awesomeness
groupByKey makes it too easy to accidently collect individual records which are too large to process. We will talk about how to replace it in different common cases with more memory efficient operations.
Effective caching & checkpointing
Being able to reuse previously computed RDDs without recomputing can substantially reduce execution time. Choosing when to cache, checkpoint, or what storage level to use can have a huge performance impact.
Considerations for noisy clusters
Functional transformations with Spark Datasets
How to have the some of benefits of Spark’s DataFrames while still having the ability to work with arbitrary Scala code
A super fast introduction to Spark and glance at BEAMHolden Karau
Apache Spark is one of the most popular general purpose distributed systems, with built in libraries to support everything from ML to SQL. Spark has APIs across languages including Scala, Java, Python, and R -- with more 3rd party language support (like Julia & C#). Apache BEAM is a cross-platform tool for building on top of different distributed systems, but its in it’s early stages. This talk will introduce the core concepts of Apache Spark, and look to the potential future of Apache BEAM.
Apache Spark has two core abstractions for representing distributed data and computations. This talk will introduce the basics of RDDs and Spark DataFrames & Datasets, and Spark’s method for achieving resiliency. Since it’s a big data talk, we will include the almost required wordcount example, and end the Spark part with follow up pointers on Spark’s new ML APIs. For folks who are interested we’ll then talk a bit about portability, and how Apache BEAM aims to improve portability (as well it’s unique approach to cross-language support).
Slides from Holden's talk at https://www.meetup.com/Wellington-Data-Scaling-Chats/events/mdcsdpyxcbxb/
Holden Karau walks attendees through a number of common mistakes that can keep your Spark programs from scaling and examines solutions and general techniques useful for moving beyond a proof of concept to production.
Topics include:
Working with key/value data
Replacing groupByKey for awesomeness
Key skew: your data probably has it and how to survive
Effective caching and checkpointing
Considerations for noisy clusters
Functional transformations with Spark Datasets: getting the benefits of Catalyst with the ease of functional development
How to make our code testable
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
Apache Spark has driven a lot of adoption of both Scala and functional programming concepts in non-traditionally industries. For many programmers in the big data world they coming looking for a solution to scaling their code, and quickly find themselves dealing with immutable data structures and lambdas, and those who love it stay. However, there is a dark side (of escape), much of Spark’s functional programming is changing, and even though it encourages functional programming it’s in a variety of languages with different expectations (in-line XML as a valid part of your language is fun!). This talk will look at how Spark does a good job of introduce folks to concepts like immutability, but also places where we maybe don’t do a great job of setting up developers for a life of functional programming. Things like accumulators, our three different models for streaming data, and an “interesting” approach to closures (come to find out what the ClosuerCleaner does, stay to find out why). The talk will close out with a look at how the functional inspired API is in exposed in the different languages, and how this impacts the kind of code written (Scala, Java, and Python – other languages are supported by Spark but I don’t want to re-learn Javascript or learn R just for this talk). Pictures of cute animals will be included in the slides to distract from the sad parts.
Video: https://www.youtube.com/watch?v=EDJfpkDpoE4
Getting The Best Performance With PySparkSpark Summit
This document provides an overview of techniques for getting the best performance with PySpark. It discusses RDD reuse through caching and checkpointing. It explains how to avoid issues with groupByKey by using reduceByKey or aggregateByKey instead. Spark SQL and DataFrames are presented as alternatives that can improve performance by avoiding serialization costs for Python users. The document also covers mixing Python and Scala code by exposing Scala functions to be callable from Python.
Beyond Shuffling, Tips and Tricks for Scaling Apache Spark updated for Spark ...Holden Karau
Beyond Shuffling - Tips & Tricks for scaling your Apache Spark programs. This talk walks through a number of common mistakes which can keep our Spark programs from scaling and examines the solutions, as well as general techniques useful for moving from beyond a prof of concept to production. It covers topics like effective RDD re-use, considerations for working with key/value data, and finishes up with an introduction to one of Spark's newest features: Datasets.
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
This document discusses scaling Apache Spark applications and some of the unintended consequences that can arise. It covers Spark's core abstractions of RDDs and DataFrames for distributed data and computation. It explains how Spark's lazy evaluation model and use of deterministic partitioning can impact reusing data and operations like groupByKey. It also discusses challenges that can arise from Spark's support for arbitrary functions and working with non-JVM languages like Python.
Strata NYC 2015 - What's coming for the Spark communityDatabricks
In the last year Spark has seen substantial growth in adoption as well as the pace and scope of development. This talk will look forward and discuss both technical initiatives and the evolution of the Spark community.
On the technical side, I’ll discuss two key initiatives ahead for Spark. The first is a tighter integration of Spark’s libraries through shared primitives such as the data frame API. The second is across-the-board performance optimizations that exploit schema information embedded in Spark’s newer APIs. These initiatives are both designed to make Spark applications easier to write and faster to run.
On the community side, this talk will focus on the growing ecosystem of extensions, tools, and integrations evolving around Spark. I’ll survey popular language bindings, data sources, notebooks, visualization libraries, statistics libraries, and other community projects. Extensions will be a major point of growth in the future, and this talk will discuss how we can position the upstream project to help encourage and foster this growth.
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
Description
This talk assumes you have a basic understanding of Spark (if not check out one of the intro videos on youtube - http://bit.ly/hkPySpark ) and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs - this is the talk for you.
Abstract
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames and traditional RDDs with Python. Looking at Spark 2.0; we examine how to mix functional transformations with relational queries for performance using the new (to PySpark) Dataset API. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
This document summarizes the upcoming features in Spark 2.0, including major performance improvements from Tungsten optimizations, unifying DataFrames and Datasets into a single API, and new capabilities for streaming data with Structured Streaming. Spark 2.0 aims to further simplify programming models while delivering up to 10x speedups for queries through compiler techniques that generate efficient low-level execution plans.
Apache Spark is a fast, general-purpose, and easy-to-use cluster computing system for large-scale data processing. It provides APIs in Scala, Java, Python, and R. Spark is versatile and can run on YARN/HDFS, standalone, or Mesos. It leverages in-memory computing to be faster than Hadoop MapReduce. Resilient Distributed Datasets (RDDs) are Spark's abstraction for distributed data. RDDs support transformations like map and filter, which are lazily evaluated, and actions like count and collect, which trigger computation. Caching RDDs in memory improves performance of subsequent jobs on the same data.
SparkSQL: A Compiler from Queries to RDDsDatabricks
SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
The document provides an introduction to Apache Spark and Scala. It discusses that Apache Spark is a fast and general-purpose cluster computing system that provides high-level APIs for Scala, Java, Python and R. It supports structured data processing using Spark SQL, graph processing with GraphX, and machine learning using MLlib. Scala is a modern programming language that is object-oriented, functional, and type-safe. The document then discusses Resilient Distributed Datasets (RDDs), DataFrames, and Datasets in Spark and how they provide different levels of abstraction and functionality. It also covers Spark operations and transformations, and how the Spark logical query plan is optimized into a physical execution plan.
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
Alpine academy apache spark series #1 introduction to cluster computing with python & a wee bit of scala. This is the first in the series and is aimed at the intro level, the next one will cover MLLib & ML.
Getting the best performance with PySpark - Spark Summit West 2016Holden Karau
This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you. This talk covers a number of important topics for making scalable Apache Spark programs – from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
apidays New York 2025 - Turn API Chaos Into AI-Powered Growth by Jeremy Water...apidays
Turn API Chaos Into AI-Powered Growth
Jeremy Waterkotte, Solutions Consultant, Alliances at Boomi
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Understanding Large Language Model Hallucinations: Exploring Causes, Detectio...Tamanna36
This presentation delves into Large Language Model (LLM) hallucinations—incorrect or fabricated outputs that undermine reliability. It covers their causes (e.g., data limitations, transformer architecture), detection methods (like semantic entropy), prevention strategies (fine-tuning, RAG), and ethical concerns (misinformation, bias). The role of tokens and MLOps in managing hallucinations is explored, alongside the feasibility of hallucination-free LLMs. Designed for researchers, developers, and AI enthusiasts, it offers insights and practical approaches to enhance LLM accuracy and trustworthiness in critical applications like healthcare and legal systems.
14th International Conference on Advanced Computer Science and Information Te...ijitcs
Call for Research Papers!!
Welcome to ICAIT 2025
14th International Conference on Advanced Computer Science and Information Technology (ICAIT 2025)
September 20 ~ 21, 2025, Copenhagen, Denmark
Webpage URL: https://itcse2025.org/icait/index
Submission URL: https://itcse2025.org/submission/index
Submission Deadline: May 24, 2025
Contact Us
Here's where you can reach us : [email protected] (or) [email protected]
The final presentation of our time series forecasting project for the "Data Science for Society and Business" Master's program at Constructor University Bremen
apidays New York 2025 - From UX to AX by Karin Hendrikse (Netlify)apidays
From UX to AX: Designing for an AI Agent World
Karin Hendrikse, Senior Software Engineer at Netlify
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - AI for All by Ananya Upadhyay (United Rentals, Inc.)apidays
AI for All: Industry Use Cases & Early Career Impact in GenAI
Ananya Upadhyay, AI Developer at United Rentals, Inc.
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
Chapter 2 protozoa and their phylum to gethamzagobena8
Protozoa the above uploaded image of a silt fence and make multiple choice questions about the mentioned topics above the above uploaded image of a few weeks ago and I used to you tomorrow morning
Embracing AI in Project Management: Final Insights & Future VisionKavehMomeni1
🚀 Unlock the Future of Project Management: Embracing AI – Final Session!
This presentation is the culminating session (Session 13) of the "AI Applications in Project Management Workshop," hosted by OnAcademy and instructed by Kaveh Momeni, PMP®, COB & AI Lead at Chaharsotoon.
Dive deep into "Embracing AI: Empowering Project Managers for an AI-Driven Future." We consolidate critical learnings from the entire workshop and provide a forward-looking perspective on how AI is revolutionizing project management.
Inside, you'll discover:
A Comprehensive Course Recap: Key takeaways from across the workshop, covering everything from knowledge management and predictive analytics to AI agents.
Cutting-Edge AI Trends: The latest developments in AI impacting PM, including market growth, task automation, and the rise of autonomous project assistants.
AI vs. Human Capabilities: Understanding the unique strengths of AI and the irreplaceable value of human intuition, strategic thinking, and leadership in PM.
Optimizing Human-AI Collaboration: Practical models and frameworks for seamlessly integrating AI tools into PM workflows, emphasizing prompt engineering and growth mindsets.
Cultivating AI-Ready Mindsets: Strategies to foster organizational cultures that embrace AI as an opportunity for innovation and competitive advantage.
Essential Skills for Future-Proof PMs: Identifying the core competencies, including AI literacy, data-driven decision-making, and ethical AI governance, crucial for thriving in an AI-augmented world.
Implementation Roadmap & Best Practices: A strategic guide for integrating AI into your projects and organizations, from pilot projects to establishing Centers of Excellence.
Ethical & Practical Considerations: Navigating data quality, bias, transparency, regulatory compliance (like the EU AI Act), and human-centric values in AI-driven PM.
A Vision for AI-Enabled PM: Envisioning AI as a strategic partner, leading to enhanced outcomes, sustainable competitive advantage, and the rise of the "AI-Augmented PM."
Actionable Next Steps: Concrete steps you can take today to advance your AI journey in project management.
Presented by Kaveh Momeni, a seasoned Project Manager with 15+ years of experience and extensive AI/ML certifications from leading institutions. This session is designed to empower project managers, team leaders, and decision-makers to confidently navigate and leverage AI for transformative project success.
Perfect for anyone looking to understand the strategic implications of AI in project delivery and how to prepare for an AI-driven future.
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Designer
Copy Link & Paste in Google👉👉👉 https://alipc.pro/dl/
Glary Utilities Pro Crack Glary Utilities Pro Crack Free Download is an amazing collection of system tools and utilities to fix, speed up, maintain and protect your PC.
apidays New York 2025 - Agentic AI Future by Seena Ganesh (Staples)apidays
Agentic AI Future: Agents Reshaping Digital Transformation and API Strategy
Seena Ganesh, Vice President Engineering - B2C & B2B eCommerce & Digital AI at Staples
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
apidays New York 2025 - Build for ALL of Your Users by Anthony Lusardi (liblab)apidays
Build for ALL of Your Users
Anthony Lusardi, Developer Relations Engineer at liblab
apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025
------
Check out our conferences at https://www.apidays.global/
Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8
Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io
Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/
6. Who is Boo?
● Boo uses she/her pronouns (as I told the Texas house committee)
● Best doge
● Lot’s of experience barking at computers to make them go faster
● Author of “Learning to Bark” & “High Performance Barking”
○ Currently out of print, discussing a reprint re-run with my wife
● On twitter @BooProgrammer
7. Why Google Cloud care about Spark?
● Lots of data!
○ We mostly use different, although similar FP inspired, tools internally
● We have two hosted solutions for using Spark (dataproc & GKE)
○ I have a blog post on how to try out custom/new versions of Spark if you want to help us test
the next RCs (2.1.3 / 2.4 probably) -
https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-cha
nges-on-google-kubernetes-engine-and-cloud-dataproc
8. Who do I think y’all are?
● Friendly[ish] people
● Don’t mind pictures of cats or stuffed animals
● May or may not know some Scala
○ If you’re new to Scala welcome to the community!
● Might know some Spark
● Want to keep things functional
● Ok with things getting a little bit silly
Lori Erickson
9. What will be covered?
● What is Spark (super brief) & how it’s helped drive FP to enterprise
● What Datasets mean for Spark instead of RDDs
● Current limitations of Datasets (and the sad implications as a result)
● What Dataset let accomplish that we couldn’t* before
● What we can do to make this more awesome for future generations
● We’re going to talk about a lot of things we need to fix but please remember
everything is has lots of things that need fixing to.
10. What is Spark?
● General purpose distributed system
○ Built in Scala with an FP inspired API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
11. Why people come to Spark:
Well this MapReduce
job is going to take
16 hours - how long
could it take to learn
Spark?
dougwoods
12. Why people come to Spark:
My DataFrame won’t fit
in memory on my cluster
anymore, let alone my
MacBook Pro :( Maybe
this Spark business will
solve that...
brownpau
14. What is the “magic” of Spark?
● Automatically distributed functional programming :)
● DAG / “query plan” is the root of much of it
● Optimizer to combine steps
● Resiliency: recover from failures rather than protecting
from failures.
● “In-memory” + “spill-to-disk”
● Functional programming to build the DAG for “free”
● Select operations without deserialization
● The best way to trick people into learning functional
programming
Richard Gillin
15. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
Paul Hudson
16. What Spark got right (for Scala/FP):
● Strong enforced[ish] requirement for immutable data
○ Use recompute for failure so a core part of the logic
● Functional operators (map, filter, flatMap, etc.)
● Lambdas for everyone!
○ Sometime too many….
● Solved a “business need”
○ Even if that need was imaginary
● Made it hard to have side effects against external variables without being very
explicit & verbose
○ Even then discouraged strongly
Stuart
17. What Spark got … less right (for Scala/FP):
● Serialization… complications
○ Makes people think closures are more limited than they can be
● Lots of Map[String, String] (equivalent) settings
○ Hey buddy can you spare a type checker?
● Hard to debug, could be confused with Scala hard to debug
○ Not completely unjustified sometimes
● New ML & SQL APIs without “any” types (initially)
indamage
18. What are these “new” APIs?
● First of what is “new” - replaces an old not yet removed working thing with
something that might work
● DataFrames - not that new, kind of superseed ish by Datasets (yay)
● “New” ML API (called ML) - Look ma no types :(
○ We “forgot” to add a serving layer. We started, but then got bored.
● Structured Streaming
○ Hey buddy, want to try a new execution engine? It might not lose your data. Don’t pay any
attention to the missing/broken windows, self-joins, changing APIs, and…. yeah maybe give it
a few months
Susanne Nilsson
19. DataFrames/Datasets
● DataFrames: Everything is a Row. Even case classes are Rows.
● Datasets: Oh shit, types were useful lets add those back.
● More SQL inspired than functional inspired
○ select etc.
● Started out no functional operations or types, added later (and it shows)
● Schema (not type) inference
○ “How many people know the types of their JSON data?”/ eskati everyone say “fuck json”
○ If you don’t get that reference listen to lil’ pump (or not)
● No automatic tuple magic on read instead “Row” of pretty much anything
● Overhead to apply strict types
● Many many operations through away types
● Required for much of Spark’s new functionality
○ RDDs will still be around, but… the cool new toys are in Datasets :(
Paul Harrison
20. Why are Datasets so awesome?
● Easier to mix functional style and relational style
○ No more hive UDFs!
● Nice performance of Spark SQL flexibility of RDDs
○ Tungsten (better serialization)
○ Equivalent of Sortable trait
● Strongly typed
● The future (ML, Graph, etc.)
● Potential for better language interop
○ Something like Arrow has a much better chance with Datasets
○ Cross-platform libraries are easier to make & use
Will Folsom
23. What about compared to Kryo?
● Depend who you listen to
○ According to the people who wrote it still better
● Nominally also allows sort operations directly on serialized data
○ Some restrictions do apply
● Custom classes with complex times require custom work :(
laurenbeth93
24. Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.filter($"happy" === true).
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
25. So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
26. And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
27. A Word count w/Datasets (ish)
val df = spark.read.load(src).select("text")
val ds = df.as[String]
// Returns an Dataset!
val words = ds.flatMap(x => x.split(" "))
val grouped = words.groupBy("value")
val word_count = grouped.agg(count("*") as
"count")
word_count.write.format("parquet").save("wc")
Can’t push down
filters from here
If it’s a simple type we don’t
have to define a case class
Loose type information
33. UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x:
len(x), IntegerType())
Yağmur Adam
34. Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql("SELECT firstCol,
strLen(stringCol) from myTable")
35. Aggregates - Classes are fun right?
abstract class UserDefinedAggregateFunction {
def initialize(buffer: MutableAggregationBuffer): Unit
def update(buffer: MutableAggregationBuffer, input: Row): Unit
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit
def evaluate(buffer: Row): Any
}
Sil Silv
36. Spark SQL Aggregates
● We could make a functional version, but we haven’t yet
● Maybe simple good PR for someone looking to help us keep it functional :p
○ Although to be fair their might be push back
● Hint hint :)
37. Using UDFs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol),
dateTimeFunction(format)(df(unixTimeStamp).cast(TimestampType))
38. Functions.scala: Everything is a string (or column)
● Lots of operators, yay!
● Mini sadness
● Frameless brings typed columns! -
https://github.com/typelevel/frameless/blob/master/dataset/src/main/scala/fra
meless/TypedColumn.scala
39. Spark ML pipelines
● Scikit inspired
● No types :(
○ Instead kind of hokey runtime schema checking that isn’t always correct
○ When it fails you can have a job fail after 8+ hours :(
● Frameless to the (optional) rescue -
https://github.com/typelevel/frameless/tree/master/ml/src/main/scala/frameles
s/ml/feature
● Also similar efforts exist inside of certain companies
○ Which I wish they would open source
george erws
40. Basic Dataprep pipeline for “ML”
// Combines a list of double input features into a vector
val assembler = new VectorAssembler().setInputCols(Array("age",
"education-num")).setOutputCol("features")
// String indexer converts a set of strings into doubles
val indexer = StringIndexer().setInputCol("category")
.setOutputCol("category-index")
// Can be used to combine pipeline components together
val pipeline = Pipeline().setStages(Array(assembler, indexer))
Huang
Yun
Chung
41. So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
42. What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
43. Adding some ML (no longer cool -- DL)
// Specify model
val dt = new DecisionTreeClassifier()
.setLabelCol("category-index")
.setFeaturesCol("features")
// Add it to the pipeline
val pipeline_and_model = Pipeline().setStages(
List(assembler, indexer, dt))
val pipeline_model = pipeline_and_model.fit(df)
45. What does the future look like?*
*Source: https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html.
*Vendor
benchmark.
Trust but verify.
46. Arrow powered magic (numeric :p):
add = pandas_udf(lambda x, y: x + y, IntegerType())
James Willamor
47. And now we can use it for streaming too!
● StructuredStreaming - new to Spark 2.0
○ Emphasis on new - be cautious when using
● New execution engine option in 2.3
● Extends the Dataset & DataFrame APIs to represent continuous tables
● Still early stages - but now have flexibility to change engines (sort of)
48. Get a streaming dataset
// Read a streaming dataframe
val schema = new StructType()
.add("happiness", "double")
.add("coffees", "integer")
val streamingDS = spark
.readStream
.schema(schema)
.format(“parquet”)
.load(path)
Dataset
isStreaming = true
streaming
source
49. Build the recipe for each query
val happinessByCoffee = streamingDS
.groupBy($"coffees")
.agg(avg($"happiness"))
Dataset
isStreaming = true
streaming
source
Aggregate
groupBy = “coffees”
expr = avg(“happiness”)
50. Scala might matter “less”
● I float between Python & Scala so I’ll still have a job
● But I _like_ functional programming & types
● Traditionally (for better or worse) large overhead to work in Python on
distributed data
○ The overhead is quickly going down
○ As Kelly mentioned in her talk this morning, PySpark folks used sometimes to learn (some)
Scala for performance -- we’ll have to offer new shiny things instead
KLMircea
51. Key takeaways
● Datasets are a functional API
○ With easier “support” for window operations and similar compared to
RDDs
○ We can still sell enterprise support contracts and training to banks.
● Spark ML still uses Dataframes (no types)
○ Frameless has types for (some of) it!
○ Yes you can use deep learning with it. No I didn’t talk about that, it’s
extra.
● We have some important work to do to keep functional
programming competitive with SQL in Spark.
○ And with Python, seriously.
jeffreyw
52. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Spark in Action
High Performance SparkLearning PySpark
53. High Performance Spark!
Available today!
You can buy it from that scrappy Seattle bookstore, Jeff
Bezos needs another newspaper and I want a cup of
coffee.
http://bit.ly/hkHighPerfSpark
54. And some upcoming talks:
● June
○ Live streams (this Friday & weekly*) - follow me on twitch & YouTube
● July
○ Possible PyData Meetup in Amsterdam (tentative)
○ Curry on Amsterdam
○ OSCON Portland
● August
○ JupyterCon NYC
● September
○ Strata NYC
○ Strangeloop STL
55. k thnx bye :)
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Pssst: Have feedback on the presentation? Give me a shout
([email protected] or http://bit.ly/holdenTalkFeedback ) if
you feel comfortable doing so :)
Feedback (if you are so inclined):
http://bit.ly/holdenTalkFeedback