Scaling spark

Dec 10, 2015Download as PPTX, PDF2 likes1,155 views

This document discusses how a media platform scaled their use of Spark across AWS to process terabytes of data daily. They moved from two on-premise clusters to running analytics and streaming workloads on AWS while keeping their core workload on-premise, initially using Spark on EMR but then self-managing Spark on EC2 for more flexibility. They implemented auto-scaling of the AWS clusters to maintain utilization targets and handle fluctuating workload demands.

Alex Rovner, Director of Data Engineering
Media Platform
Processing Terabytes Daily

PRIOR STATE
TWO CLUSTERS
CORE & ANALYTICS
BOTH IN COLO

CHALLENGES
SCALABILITY
ELASTICITY
AGILITY

SPARK
SCALABLE
FRIENDLY API
PYTHON SUPPORT

D2.8XLARGE
48TB OF EPHEMERAL STORAGE
244 GB RAM
38 V-CPU
INSTANCES

INSTANCES
WAIT, WHAT ABOUT DATA
LOCALITY?

HADOOP
RUN THE LATEST VERSION!
TECH.MAGNETIC.COM

CALCULATE CLUSTER
UTILIZATION
QUERY CM API
V-CORES AVAILABLE, USED &
PENDING
AUTO SCALE

CALCULATE TARGET CAPACITY
TARGET 80% UTILIZATION
LIMIT DOWNSIZING
AUTO SCALE

SPEED BUMPS
APPLICATION MASTER ON SPOT
YARN LABELS

SPEED BUMPS
USERS ARE IMPATIENT
ITS NEVER ENOUGH

SPEED BUMPS
I AM LOST!
YARN LOGS
SET YARN OVERHEAD
CHECK GC TIME
INCREASE EXECUTOR MEMORY
TRY AGAIN

SPEED BUMPS
BROADCASTING “LARGE”
DATASETS IS EVIL

CURRENT
STATE
THREE CLUSTERS
ANALYTICS & STREAMING (AWS)
CORE (COLO - MOVING SOON!)

In this deck from the DDN User Group at SC19, Gael Delbray from CEA presents: Optimizing Flash at Scale. CEA, a major player in research and innovation, has been recognized as an expert in HPC through the momentum of the "Simulation Programme" supported by its Direction des Applications Militaires (CEA / DAM) and implemented by the Department of Simulation Sciences and Information (DSSI). "The major challenges that the HPC will face in the coming years are manifold, such as the development of hardware and software architectures able to deliver very high computing power, modelling methods combining different scales and physical models and the management of huge volumes of numerical data. High performance computing for numerical simulation has become an essential tool in scientific and technological research, as well as for industrial applications. Simulation can replace experiments that are too dangerous (accidental situations), beyond reach in terms of time or scale (climate, astrophysics) or banned (nuclear tests). Simulation is also time-saving and leverages productivity in many situations." Watch the video: https://wp.me/p3RLHQ-li0 Learn more: https://www.ddn.com/company/events/user-group-sc/ Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Bartolomeo_ASGSRDr. Per Christian Steimle

The document describes commercial external payload hosting platforms on the International Space Station (ISS) for space research. It discusses Airbus Defense and Space and Teledyne Brown Engineering's platforms called NanoRacks External Platform and Bartolomeo. NanoRacks will host small payloads of up to 100kg starting in early 2016. Bartolomeo is designed to host multiple medium to large payloads up to 500kg and will provide power, data, cooling and pointing capabilities. The document provides details on payload accommodations, operations concepts and end-to-end commercial service for customers.

Sierra overviewGanesan Narayanasamy

Sierra will be LLNL's next advanced technology system and part of the CORAL collaboration between ORNL, ANL, and LLNL. Sierra will replace the current Sequoia system and feature an IBM POWER9 and NVIDIA Volta GPU accelerated architecture with over 125 PFLOPS of peak performance. Benchmark projections show the GPU-accelerated Sierra system is expected to deliver substantial performance gains compared to a CPU-only configuration. Sierra and its follow-on systems will usher in an accelerator-based computing era at LLNL.

Top FME Recipes: RasterSafe Software

OpenACC Monthly Highlights June 2017NVIDIA

GTC Taiwan 2017 如何在充滿未知的巨量數據時代中建構一個數據中心NVIDIA Taiwan

This document discusses the evolution of data storage needs from traditional structured data to modern unstructured data like objects and machine data. It outlines the four industrial revolutions defined by major technological advances. Pure Storage's FlashBlade is introduced as the industry's first data hub purpose-built for AI and deep learning, with massively parallel architecture powered by Purity software to scale without limits. Real-world customer examples demonstrate how FlashBlade accelerates AI initiatives for autonomous vehicles and powers some of the world's most powerful AI supercomputers.

GTC 2017: Powering the AI RevolutionNVIDIA

Jensen Huang, founder and CEO of NVIDIA, discusses the rise of GPU computing and artificial intelligence. He outlines how GPUs have enabled massive performance increases for deep learning workloads. NVIDIA is introducing new products like the Tesla V100 GPU and DGX-1 server to further accelerate AI research and commercial applications. These announcements position NVIDIA to power continued growth in AI and deep learning.

Circuit SimplifierVineet Markan

This document describes an electric circuit solver program called Electrica that takes in a jumbled circuit diagram and outputs an organized 2D representation with reduced overlaps. It uses C++ and OpenGL and stores circuit components as nodes connected by edges in a data structure. The algorithm finds the longest chain in the input circuit to lay out the skeleton and draws connections orthogonally while color coding components. The program aims to simplify circuit analysis and has applications in PCB routing, mesh processing, and pathfinding.

Scaling graphite to handle a zerg rushDaniel Ben-Zvi

This document discusses scaling issues with Graphite and solutions implemented at Similarweb to handle high volumes of metrics. Key points: 1) Graphite struggled with high IOPS and a single-threaded carbon-cache. Replacing carbon-cache with the multi-threaded go-carbon and using SSDs helped address IOPS bottlenecks. 2) carbon-relay was replaced with the faster C implementation carbon-c-relay to load balance metrics among go-carbon instances. 3) statsd was replaced with the C implementation statsite for better performance and capabilities like quantiles. 4) The final setup consisted of statsite sending to multiple carbon-c-relay and go-carbon instances, handling

GTC Taiwan 2017 企業端深度學習與人工智慧應用NVIDIA Taiwan

This document discusses the evolution of computing from PCs to mobile-cloud to AI and IoT. It highlights how deep learning using GPUs has become a new computing model, with neural network complexity exploding to tackle increasingly complex challenges. It introduces Nvidia's Volta GPU and how it delivers revolutionary performance for deep learning training and inference through new tensor cores and optimizations for deep learning frameworks and models.

An Update on Arm HPCinside-BigData.com

In this deck from the Linaro Connect conference, Brent Gorda presents an update on ARM for HPC. "Arm-based systems are showing up in the HPC community and new silicon is coming. The architecture has also been selected for several of the exascale projects worldwide. Brent will talk about the aspects of Arm that are attractive to the HPC community, updates on projects and what we as a community can do to help accelerate adoption in this space." Watch the video: https://insidehpc.com/2019/09/an-update-on-arm-in-hpc/ Learn more: https://developer.arm.com/tools-and-software/server-and-hpc Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

High Performance Interconnects: Assessment & Rankingsinside-BigData.com

In this deck from the HPC Advisory Council Spain Conference, Dan Olds from OrionX discusses the High Performance Interconnect (HPI) market landscape, plus provides ratings and rankings of HPI choices today. "The HPI market is the very high-end of the networking equipment market where high bandwidth and low latency are non-negotiable. It started out as a specialist proprietary segment but has blossomed into an indispensable, large, and growing area. Products in this category are used to build extreme-scale computing systems. They are typically not used for traditional telco, enterprise, or service provider networking needs. In this talk, we’ll take a look at the technologies and performance of their high-end technology and the coming battle between onloading vs. offloading interconnect architectures." Watch the video presentation: http://wp.me/p3RLHQ-fON Learn more: http://orionx.net/wp-content/uploads/2016/06/HPI-Environment-OrionX-Constellation-DataCenter-20160626.pdf Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

OPTIMIZING THE TICK STACKInfluxData

Cypher for GremlinopenCypher

Programming Languages & Tools for Higher Performance & ProductivityLinaro

By Hitoshi Murai, RIKEN AICS For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown. Hitoshi Murai Bio Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages. Email [email protected] For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/

Enable IPv6 on Route53 AWS ELB, docker and node AppFyllo

Arm in HPCinside-BigData.com

In this video from the Rice Oil & Gas Conference, Brent Gorda from ARM presents: ARM in HPC. "With the recent Astra system at Sandia Lab (#203 on the Top500) and HPE Catalyst project in the UK, Arm-based architectures are arriving in HPC environments. Several partners have announced or will soon announce new silicon and projects, each of which offers something different and compelling for our community. Brent will describe the driving factors and how these solutions are changing the landscape for HPC." Watch the video: https://wp.me/p3RLHQ-jXS Learn more: https://developer.arm.com/hpc Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter

Debugging & Tuning in SparkShiao-An Yuan

This document provides tips and best practices for debugging and tuning Spark applications. It discusses Spark concepts like RDDs, transformations, actions, and the DAG execution model. It then gives recommendations for improving correctness, reducing overhead from parallelism, avoiding data skew, and tuning configurations like storage level, number of partitions, executor resources and joins. Common failures are analyzed along with their causes and fixes. Overall it emphasizes the importance of tuning partitioning, avoiding shuffles when possible, and using the right configurations to optimize Spark jobs.

Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly

Spark After Dark is a mock dating site that uses the latest Spark libraries including Spark SQL, BlinkDB, Tachyon, Spark Streaming, MLlib, and GraphX to generate high-quality dating recommendations for its members and blazing fast analytics for its operators. We begin with brief overview of Spark, Spark Libraries, and Spark Use Cases. In addition, we'll discuss the modern day Lambda Architecture that combines real-time and batch processing into a single system. Lastly, we present best practices for monitoring and tuning a highly-available Spark and Spark Streaming cluster. There will be many live demos covering everything from basic topics such as ETL and data ingestion to advanced topics such as streaming, sampling, approximations, machine learning, textual analysis, and graph processing.

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Badri Narayan Bhaskar

This document summarizes scaling machine learning to billions of parameters using Spark and a parameter server architecture. It describes the requirements for supporting both batch and sequential optimization at web scale. It then outlines the Spark + Parameter server approach, leveraging Spark for distributed processing and the parameter server for synchronizing model updates. Examples of distributed L-BFGS and Word2Vec training are provided to illustrate batch and sequential optimization respectively using this architecture.

Foundations for Scaling ML in Apache SparkDatabricks

Apache Spark has become the most active open source Big Data project, and its Machine Learning library MLlib has seen rapid growth in usage. A critical aspect of MLlib and Spark is the ability to scale: the same code used on a laptop can scale to 100’s or 1000’s of machines. This talk will describe ongoing and future efforts to make MLlib even faster and more scalable by integrating with two key initiatives in Spark. The first is Catalyst, the query optimizer underlying DataFrames and Datasets. The second is Tungsten, the project for approaching bare-metal speeds in Spark via memory management, cache-awareness, and code generation. This talk will discuss the goals, the challenges, and the benefits for MLlib users and developers. More generally, we will reflect on the importance of integrating ML with the many other aspects of big data analysis.

Scaling Analytics with Apache SparkQuantUniversity

Since its debut in 2010, Apache Spark has become one of the most popular Big Data technologies in the Apache open source ecosystem. In addition to enabling processing of large data sets through its distributed computing architecture, Spark provides out-of-the-box support for machine learning, streaming and graph processing in a single framework. Spark has been supported by companies like Microsoft, Google, Amazon and IBM and in financial services, companies like Blackrock (http://bit.ly/1Q1DVJH ) and Bloomberg (http://bit.ly/29LXbPv ) have started to integrate Apache Spark into their tool chain and the interest is growing. Unlike other big-data technologies which require intensive programming using Java etc., Spark enables data scientists to work with a big-data technology using higher level languages like Python and R making it accessible to conduct experiments and for rapid prototyping. In this talk, we will introduce Apache Spark and discuss the key features that differentiate Apache Spark from other technologies. We will provide examples on how Apache Spark can help scale analytics and discuss how the machine learning API could be used to solve large-scale machine learning problems using Spark’s distributed computing framework. We will also illustrate enterprise use cases for scaling analytics with Apache Spark.

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

The talk by Maksud Ibrahimov, Chief Data Scientist at InfoReady Analytics. He is going to share with us how to maximise the performance of Spark. As a user of Apache Spark from very early releases, he generally sees that the framework is easy to start with but as the program grows its performance starts to suffer. In this talk Maksud will answer the following questions: - How to reach higher level of parallelism of your jobs without scaling up your cluster? - Understanding shuffles, and how to avoid disk spills - How to identify task stragglers and data skews? - How to identify Spark bottlenecks?

Machine learning at Scale with Apache SparkMartin Zapletal

This document discusses scaling machine learning using Apache Spark. It covers several key topics: 1) Parallelizing machine learning algorithms and neural networks to distribute computation across clusters. This includes data, model, and parameter server parallelism. 2) Apache Spark's Resilient Distributed Datasets (RDDs) programming model which allows distributing data and computation across a cluster in a fault-tolerant manner. 3) Examples of very large neural networks trained on clusters, such as a Google face detection model using 1,000 servers and a IBM brain-inspired chip model using 262,144 CPUs.

Spark 1.6 vs Spark 2.0Sigmoid

Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA

This document discusses scaling big data using Apache Spark. It provides an overview of Spark's philosophy of providing a unified engine to support end-to-end applications using high-level APIs. It outlines some of the new features in Apache Spark 2.0, including improvements to structured APIs, structured streaming, and new deep learning and graph processing libraries. It also discusses initiatives by Databricks to grow the Spark community through massive open online courses and a free community edition of the Databricks platform.

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit

This document discusses Spark ML pipelines for machine learning workflows. It begins with an introduction to Spark MLlib and the various algorithms it supports. It then discusses how ML workflows can be complex, involving multiple data sources, feature transformations, and models. Spark ML pipelines allow specifying the entire workflow as a single pipeline object. This simplifies debugging, re-running on new data, and parameter tuning. The document provides an example text classification pipeline and demonstrates how data is transformed through each step via DataFrames. It concludes by discussing upcoming improvements to Spark ML pipelines.

Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks

As the adoption of Spark Streaming increases rapidly, the community has been asking for greater robustness and scalability from Spark Streaming applications in a wider range of operating environments. To fulfill these demands, we have steadily added a number of features in Spark Streaming. We have added backpressure mechanisms which allows Spark Streaming to dynamically adapt to changes in incoming data rates, and maintain stability of the application. In addition, we are extending Spark’s Dynamic Allocation to Spark Streaming, so that streaming applications can elastically scale based on processing requirements. In my talk, I am going to explore these mechanisms and explain how developers can write robust, scalable and adaptive streaming applications using them. Presented by Tathagata "TD" Das from Databricks.

How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit

This document summarizes work done by an Intel software team in China to improve Apache Spark performance for real-world applications. It describes benchmarking tools like HiBench and profiling tools like HiMeter that were developed. It also discusses several case studies where the team worked with customers to optimize joins, manage memory usage, and reduce network bandwidth. The overall goal was to help solve common issues around ease of use, reliability, and scalability for Spark in production environments.

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly

Spark Streaming allows for processing of real-time data streams using Spark. The document discusses using Spark Streaming with Amazon Kinesis for streaming data ingestion. It covers the Spark Streaming and Kinesis integration architecture, how the Spark Kinesis receiver works, scaling considerations, and fault tolerance mechanisms through checkpointing. Examples of monitoring and tuning Spark Streaming jobs on Kinesis data are also provided.

More Related Content

What's hot (9)

Scaling graphite to handle a zerg rushDaniel Ben-Zvi

GTC Taiwan 2017 企業端深度學習與人工智慧應用NVIDIA Taiwan

An Update on Arm HPCinside-BigData.com

High Performance Interconnects: Assessment & Rankingsinside-BigData.com

OPTIMIZING THE TICK STACKInfluxData

Cypher for GremlinopenCypher

Programming Languages & Tools for Higher Performance & ProductivityLinaro

Enable IPv6 on Route53 AWS ELB, docker and node AppFyllo

Arm in HPCinside-BigData.com

Scaling graphite to handle a zerg rushDaniel Ben-Zvi

GTC Taiwan 2017 企業端深度學習與人工智慧應用NVIDIA Taiwan

An Update on Arm HPCinside-BigData.com

High Performance Interconnects: Assessment & Rankingsinside-BigData.com

OPTIMIZING THE TICK STACKInfluxData

Cypher for GremlinopenCypher

Programming Languages & Tools for Higher Performance & ProductivityLinaro

Enable IPv6 on Route53 AWS ELB, docker and node AppFyllo

Arm in HPCinside-BigData.com

Viewers also liked (20)

Debugging & Tuning in SparkShiao-An Yuan

Spark After Dark - LA Apache Spark Users Group - Feb 2015Chris Fregly

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Badri Narayan Bhaskar

Foundations for Scaling ML in Apache SparkDatabricks

Scaling Analytics with Apache SparkQuantUniversity

Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov

Machine learning at Scale with Apache SparkMartin Zapletal

Spark 1.6 vs Spark 2.0Sigmoid

Big Data Day LA 2016 Keynote - Reynold Xin/ DatabricksData Con LA

Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit

Building Robust, Adaptive Streaming Apps with Spark StreamingDatabricks

How to Boost 100x Performance for Real World Application with Apache Spark-(G...Spark Summit

Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly

Spark Tuning For Enterprise System Administrators, Spark Summit East 2016Anya Bida

by Anya Bida and Rachel Warren from Alpine Data https://spark-summit.org/east-2016/events/spark-tuning-for-enterprise-system-administrators/ Spark offers the promise of speed, but many enterprises are reluctant to make the leap from Hadoop to Spark. Indeed, System Administrators will face many challenges with tuning Spark performance. This talk is a gentle introduction to Spark Tuning for the Enterprise System Administrator, based on experience assisting two enterprise companies running Spark in yarn-cluster mode. The initial challenges can be categorized in two FAQs. First, with so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Second, once I know which Spark Tuning parameters I need, how do I enforce them for the various users submitting various jobs to my cluster? This introduction to Spark Tuning will enable enterprise system administrators to overcome common issues quickly and focus on more advanced Spark Tuning challenges. The audience will understand the “cheat-sheet” posted here: http://techsuppdiva.github.io/ Key takeaways: FAQ 1: With so many Spark Tuning parameters, how do I know which parameters are important for which jobs? Solution 1: The Spark Tuning cheat-sheet! A visualization that guides the System Administrator to quickly overcome the most common hurdles to algorithm deployment. [1]http://techsuppdiva.github.io/ FAQ 2: Once I know which Spark Tuning parameters I need, how do I enforce them at the user level? job level? algorithm level? project level? cluster level? Solution 2: We’ll approach these challenges using job & cluster configuration, the Spark context, and 3rd party tools – of which Alpine will be one example. We’ll operationalize Spark parameters according to user, job, algorithm, workflow pipeline, or cluster levels.

Spark 2.0 What's Next （Hadoop / Spark Conference Japan 2016 キーノート講演資料）Hadoop / Spark Conference Japan

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit

Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit

Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends.  How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community. As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.

Making Structured Streaming Ready for ProductionDatabricks

In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine. The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows: - Design and use of the Kafka Source - Support for watermarks and event-time processing - Support for more operations and output modes Speaker: Tathagata Das This talk was originally presented at Spark Summit East 2017.

Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaHelena Edelson

Scala Days, Amsterdam, 2015: Lambda Architecture - Batch and Streaming with Spark, Cassandra, Kafka, Akka and Scala; Fault Tolerance, Data Pipelines, Data Flows, Data Locality, Akka Actors, Spark, Spark Cassandra Connector, Big Data, Asynchronous data flows. Time series data, KillrWeather, Scalable Infrastructure, Partition For Scale, Replicate For Resiliency, Parallelism Isolation, Data Locality, Location Transparency

Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks

This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science. (1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving. (2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models. (3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms. Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.