Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
Parquet is a columnar storage format for Hadoop data. It was developed collaboratively by Twitter and Cloudera to address the need for efficient analytics on large datasets. Parquet provides more efficient compression and I/O compared to row-based formats by only reading and decompressing the columns needed by a query. It has been adopted by many companies for analytics workloads involving terabytes to petabytes of data. Parquet is language-independent and supports integration with frameworks like Hive, Pig, and Impala. It provides significant performance improvements and storage savings compared to traditional row-based formats.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Implementing efficient Spark application with the goal of having maximal performance often requires knowledge that goes beyond official documentation. Understanding Spark’s internal processes and features may help to design the queries in alignment with internal optimizations and thus achieve high efficiency during execution. In this talk we will focus on some internal features of Spark SQL which are not well described in official documentation with a strong emphasis on explaining these features on some basic examples while sharing some performance tips along the way.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
This document summarizes Project Tungsten, an effort by Databricks to substantially improve the memory and CPU efficiency of Spark applications. It discusses how Tungsten optimizes memory and CPU usage through techniques like explicit memory management, cache-aware algorithms, and code generation. It provides examples of how these optimizations improve performance for aggregation queries and record sorting. The roadmap outlines expanding Tungsten's optimizations in Spark 1.4 through 1.6 to support more workloads and achieve end-to-end processing using binary data representations.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
An Optimized Diffusion Depth Of Field SolverHolger Gruen
The document summarizes an optimized diffusion depth of field (DDOF) solver that is faster and uses less memory than previous solvers. It recaps DDOF and earlier solvers, describes optimizations to a vanilla cyclic reduction solver including reducing resolution and reusing data between passes, and shows final results demonstrating improved performance and reduced memory usage compared to prior work.
CS 542 Putting it all together -- Storage ManagementJ Singh
The document provides an overview and plan for a lecture on database management systems. Key points include:
- By the second break, the lecture will cover storage hierarchies, secondary storage management, and system catalogs.
- After the second break, the topics will include data modeling and storage hierarchies.
- Storage hierarchies involve multiple storage levels from main memory to disk and beyond. The cost and performance of each level differs.
- Techniques like caching aim to keep frequently used data in faster storage levels like memory.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
Parquet is a columnar storage format for Hadoop data. It was developed collaboratively by Twitter and Cloudera to address the need for efficient analytics on large datasets. Parquet provides more efficient compression and I/O compared to row-based formats by only reading and decompressing the columns needed by a query. It has been adopted by many companies for analytics workloads involving terabytes to petabytes of data. Parquet is language-independent and supports integration with frameworks like Hive, Pig, and Impala. It provides significant performance improvements and storage savings compared to traditional row-based formats.
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
The document discusses optimizations made to Spark SQL performance when working with Parquet files at ByteDance. It describes how Spark originally reads Parquet files and identifies two main areas for optimization: Parquet filter pushdown and the Parquet reader. For filter pushdown, sorting columns improved statistics and reduced data reads by 30%. For the reader, splitting it to first filter then read other columns prevented loading unnecessary data. These changes improved Spark SQL performance at ByteDance without changing jobs.
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Spark supports four cluster managers: Local, Standalone, YARN, and Mesos. YARN is highly recommended for production use. When running Spark on YARN, careful tuning of configuration settings like the number of executors, executor memory and cores, and dynamic allocation is important to optimize performance and resource utilization. Configuring queues also allows separating different applications by priority and resource needs.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Implementing efficient Spark application with the goal of having maximal performance often requires knowledge that goes beyond official documentation. Understanding Spark’s internal processes and features may help to design the queries in alignment with internal optimizations and thus achieve high efficiency during execution. In this talk we will focus on some internal features of Spark SQL which are not well described in official documentation with a strong emphasis on explaining these features on some basic examples while sharing some performance tips along the way.
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
Kubernetes As of Spark 2.3, Spark can run on clusters managed by Kubernetes. we will describes the best practices about running Spark SQL on Kubernetes upon Tencent cloud includes how to deploy Kubernetes against public cloud platform to maximum resource utilization and how to tune configurations of Spark to take advantage of Kubernetes resource manager to achieve best performance. To evaluate performance, the TPC-DS benchmarking tool will be used to analysis performance impact of queries between configurations set.
Speakers: Junjie Chen, Junping Du
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
This document summarizes Project Tungsten, an effort by Databricks to substantially improve the memory and CPU efficiency of Spark applications. It discusses how Tungsten optimizes memory and CPU usage through techniques like explicit memory management, cache-aware algorithms, and code generation. It provides examples of how these optimizations improve performance for aggregation queries and record sorting. The roadmap outlines expanding Tungsten's optimizations in Spark 1.4 through 1.6 to support more workloads and achieve end-to-end processing using binary data representations.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
An Optimized Diffusion Depth Of Field SolverHolger Gruen
The document summarizes an optimized diffusion depth of field (DDOF) solver that is faster and uses less memory than previous solvers. It recaps DDOF and earlier solvers, describes optimizations to a vanilla cyclic reduction solver including reducing resolution and reusing data between passes, and shows final results demonstrating improved performance and reduced memory usage compared to prior work.
CS 542 Putting it all together -- Storage ManagementJ Singh
The document provides an overview and plan for a lecture on database management systems. Key points include:
- By the second break, the lecture will cover storage hierarchies, secondary storage management, and system catalogs.
- After the second break, the topics will include data modeling and storage hierarchies.
- Storage hierarchies involve multiple storage levels from main memory to disk and beyond. The cost and performance of each level differs.
- Techniques like caching aim to keep frequently used data in faster storage levels like memory.
This document summarizes key Hadoop configuration parameters that affect MapReduce job performance and provides suggestions for optimizing these parameters under different conditions. It describes the MapReduce workflow and phases, defines important parameters like dfs.block.size, mapred.compress.map.output, and mapred.tasktracker.map/reduce.tasks.maximum. It explains how to configure these parameters based on factors like cluster size, data and task complexity, and available resources. The document also discusses other performance aspects like temporary space, JVM tuning, and reducing reducer initialization overhead.
This document provides performance optimization tips for Hadoop jobs, including recommendations around compression, speculative execution, number of maps/reducers, block size, sort size, JVM tuning, and more. It suggests how to configure properties like mapred.compress.map.output, mapred.map/reduce.tasks.speculative.execution, and dfs.block.size based on factors like cluster size, job characteristics, and data size. It also identifies antipatterns to avoid like processing thousands of small files or using many maps with very short runtimes.
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
This document summarizes a presentation about using the Crail distributed storage system to improve Spark performance on high-performance computing clusters with RDMA networking and NVMe flash storage. The key points are:
1) Traditional Spark storage and networking APIs do not bypass the operating system kernel, limiting performance on modern hardware.
2) The Crail system provides user-level APIs for RDMA networking and NVMe flash to improve Spark shuffle, join, and sorting workloads by 2-10x on a 128-node cluster.
3) Crail allows Spark workloads to fully utilize high-speed networks and disaggregate memory and flash storage across nodes without performance penalties.
Programming Languages & Tools for Higher Performance & ProductivityLinaro
By Hitoshi Murai, RIKEN AICS
For higher performance and productivity of HPC systems, it is important to provide users with good programming environment including languages, compilers, and tools. In this talk, the programming model of the post-K supercomputer will be shown.
Hitoshi Murai Bio
Hitoshi Murai received a master's degree in information science from Kyoto University in 1996. He worked as a software developer in NEC from 1996 to 2010. He received a Ph.D degree in computer science from University of Tsukuba in 2010. He is currently a research scientist of the programming environment research team and the Flagship 2020 project in Advanced Institute for Computational Science, RIKEN. His research interests include compilers and parallel programming languages.
Email
[email protected]
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
This document summarizes Chris Fregly's presentation on how Apache Spark beat Hadoop at sorting 100 TB of data. Key points include:
- Spark set a new record in the Daytona GraySort benchmark by sorting 100 TB of data in 23 minutes using 250,000 partitions on EC2.
- Optimizations that contributed to Spark's win included using CPU cache locality with (Key, Pointer) pairs, an optimized sorting algorithm, reducing network overhead with Netty, and reducing OS resources with a sort-based shuffle.
- The sort-based shuffle merges mapper outputs into a single file per partition to minimize disk seeks during the shuffle.
The document discusses computer storage media, including definitions, units of measurement, and examples. It begins with an activity asking students to classify common peripherals as input, output, or I/O devices. It then defines media storage and lists examples like USB flash drives, CDs, and DVDs. The document explains that the basic unit of digital storage is the bit and the common unit is the byte. Conversions between KB, MB, GB, and TB are provided. Finally, common storage media are described, such as hard drives, floppy disks, CDs, DVDs, flash drives, and memory cards. Exercises ask students to complete tables, choose appropriate storage media for files, and transfer folders between computers
HDT for Mainframe Considerations: Simplified Tiered StorageHitachi Vantara
Hitachi Dynamic Tiering for Mainframe (HDT) allows data to be automatically spread across storage tiers to optimize performance and capacity. With HDT, existing SMS provisioning can be aligned to tiered storage pools, reducing storage group complexities. HDT also improves flexibility by dynamically placing application data sets across physical disks based on performance needs without requiring storage administrators to manually migrate data.
Storeconfigs is not a popular feature among Puppet admins, because most don’t know how to use it or fear performance issues. Attend this talk to know how to enhance your Puppet deployments with easy cross-nodes interactions and collaborations, while conserving system efficiency.
These slides are from a recent talk I gave at Lawrence Livermore Labs.
The talk gives an architectural outline of the MapR system and then discusses how this architecture facilitates large scale machine learning algorithms.
This document discusses tuning Hadoop performance for a case study on a recommendation engine model trainer using maximum entropy. It presents the baseline performance of the model trainer on Hadoop and then evaluates various tuning techniques including changing the number of maps and reduces, adding a combiner, increasing memory buffers to reduce disk spill, and compressing map output. The results show that these tuning techniques can significantly improve the performance and reduce the execution time of the model trainer application on Hadoop.
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
The document summarizes the SOS technique for optimizing shuffle I/O in distributed computing frameworks. SOS merges small intermediate data files from map tasks into larger files to reduce the number and fragmentation of shuffle fetch requests. When deployed at Facebook scale, SOS reduced shuffle I/O by 7.5x and disk service time by 2x, while increasing average I/O size by 2.5x. These I/O optimizations translated to an overall 10% reduction in reserved CPU time for jobs.
This document discusses two methods for measuring Firebird disk I/O: 1) Using the MON$IO_STATS tables within Firebird to track page reads, writes, fetches, and marks, and 2) Using the host operating system's performance monitoring tools like Windows Performance Monitor. It notes some limitations with MON$IO_STATS and provides examples of specific disk and process counters to log. The document also covers estimating required IOPS based on potential disk throughput and accounting for factors like RAID write penalties.
아파치 네모로 빠르고 효율적으로 빅데이터 처리하기
- 송원욱, 양영석(서울대학교 컴퓨터공학부 소프트웨어 플랫폼 연구실)
개요 #
아파치 네모(Apache Nemo)는 빅데이터 애플리케이션의 분산 수행 방식을 다양한 자원 환경 및 데이터 특성에 맞춰 최적화하는 시스템입니다. Geo-distributed resources, transient resources, large data shuffle, skewed data 처리 상황에서 아파치 네모는 아파치 스파크(Apache Spark) 보다 월등하게 높은 성능을 보입니다.
목차 #
아파치 네모의 최적화 케이스 스터디
아파치 네모의 분산 실행 과정
앞으로의 연구 방향
Skew Mitigation For Facebook PetabyteScale JoinsDatabricks
Uneven distribution of input (or intermediate) data can often cause skew in joins. In Spark, this leads to very slow join stages where a few straggling tasks may take forever to finish. At Facebook, where Spark jobs shuffle hundreds of petabytes of aggregate data per day, skew in data exacerbates runtime latencies further to the order of multiple hours and even days. Over the course of last year, we introduced several state-of-art skew mitigation techniques from traditional databases that reduced query runtimes by more than 40%, and expanded Spark adoption for numerous latency sensitive pipelines. In this talk, we’ll take a deep dive into Spark’s execution engine and share how we’re gradually solving the data skew problem at scale. To this end, we’ll discuss several catalyst optimizations around implementing a hybrid skew join in Spark (that broadcasts uncorrelated skewed keys and shuffles non-skewed keys), describe our approach of extending this idea to efficiently identify (and broadcast) skewed keys adaptively at runtime, and discuss CPU vs. IOPS trade-offs around how these techniques interact with Cosco: Facebook’s petabyte-scale shuffle service (https://maxmind-databricks.pantheonsite.io/session/cosco-an-efficient-facebook-scale-shuffle-service).
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
How iCode cybertech Helped Me Recover My Lost Fundsireneschmid345
I was devastated when I realized that I had fallen victim to an online fraud, losing a significant amount of money in the process. After countless hours of searching for a solution, I came across iCode cybertech. From the moment I reached out to their team, I felt a sense of hope that I can recommend iCode Cybertech enough for anyone who has faced similar challenges. Their commitment to helping clients and their exceptional service truly set them apart. Thank you, iCode cybertech, for turning my situation around!
[email protected]
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
5. Why should you care?
▪ IO efficiency
▪ Cosco is a service that improves IO efficiency (disk service time) by 3x for shuffle data
▪ Compute efficiency
▪ Flash supports more workload with less Cosco hardware
▪ Query latency is less of a focus
▪ Cosco helps shuffle-heavy queries, but query latency has not been our focus. We have been focused on batch workloads.
▪ Flash unlocks new possibilities to improve query latency, but that is future work
▪ Techniques for development and analysis
▪ Hopefully, some of these are applicable outside of Cosco
7. Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Map output files written to local storage or distributed filesystem
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
8. Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
9. Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Sort by
key
Iterator
Iterator
Iterator
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
10. Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Sort by
key
Iterator
Iterator
Iterator
Write amplification is ~3x
Write amplification problem
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
11. Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Sort by
key
Iterator
Iterator
Iterator
Write amplification is ~3x
And small IOs problem
M x R
Avg IO size is ~200 KiB
Mappers
Map Output Files
(on disk/DFS) Reducers
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
12. Spark Shuffle Recap
Map 0
Map 1
Map m
Reduce 0
Reduce 1
Reduce r
Partition
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Sort by
key
Iterator
Iterator
Iterator
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
13. Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
(on disk/DFS) Reducers
Reducers pull from map output files
Sort by
key
Iterator
Iterator
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
14. Spark Shuffle Recap
Map 1
Map m
Reduce 1
Reduce r
Mappers
Map Output Files
(on disk/DFS) Reducers
Simplified drawing
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
15. Spark Shuffle Recap
Map 1
Map m
Mappers
Map Output Files
(on disk/DFS)
Simplified drawing
Reduce 1
Reduce r
Reducers
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
16. Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Mappers stream their output to Cosco Shuffle Services, which buffer in memory
Streaming
output
In-memory buffering
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
17. Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 0 buffer)
Partition r
(file 0 buffer)
File 0
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Sort and flush buffers to DFS when full
Streaming
output
In-memory buffering
Sort (if required by query)
Flush
Flush
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
18. Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 1 buffer)
Partition r
(file 0 buffer)
File 0
File 1
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
19. Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 0 buffer)
File 0
File 1
File 0
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
File 2
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
20. Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Sort and flush buffers to DFS when full
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
21. Iterator
Iterator
Cosco Shuffle for Spark
Reduce 1
Reduce r
Mappers Reducers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition 1
(file 2 buffer)
Partition r
(file 1 buffer)
File 0
File 1
File 2
File 0
File 1
Shuffle Services
(N = thousands)
Distributed Filesystem
(HDFS/Warm Storage)
Map m
Map 1
Streaming
output
In-memory buffering
Flush
Sort (if required by query)
Flush
Reducers do a streaming merge after map stage completes
Streaming
merge
Adapted from Cosco: An Efficient Facebook-Scale Shuffle Service, Spark Summit 2019
23. Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
24. Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
25. Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
26. Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
27. Buffering Is Appending
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
28. Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
Simply buffer to flash instead of memory
On flash
29. Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
Simply buffer to flash instead of memory
On flash
30. Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
Simply buffer to flash instead of memory
On flash
Read back to main
memory for sorting
31. Replace DRAM with Flash for Buffering
Mappers
Shuffle Service 1
Shuffle Service 2
Shuffle Service N
Partition r
Shuffle Services
(N = thousands)
Map m
Map 1
Each package is a few 10s of KiB
▪ Appending is a friendly pattern
for flash
▪ Minimize flash write amplification -> minimizing wear on
the drive
▪ Flash write/read latency is
negligible
▪ Generally non-blocking
▪ Latency is much less than buffering time
Simply buffer to flash instead of memory
On flash
Read back to main
memory for sorting
32. Example Rule of Thumb
▪ Hypothetical example numbers
▪ Assume 1 GB Flash can endure ~10 GB of writes per day for the lifetime of the device
▪ Assume you are indifferent between consuming 1 GB DRAM vs ~10 GB Flash with write throughput at the endurance limit
▪ Then, you would be indifferent between consuming 1 GB DRAM vs ~100 GB/day Flash
▪ Notes
▪ These numbers chosen entirely because they are round -> Easier to illustrate math on slides
▪ DRAM consumes more power than Flash
Would you rather consume 1 GB DRAM or flash that can endure 100 GB/day of write throughput?
33. Basic Evaluation
▪ Example Cosco cluster
▪ 10 nodes
▪ Each node uses 100 GB DRAM for buffering
▪ And has additional DRAM for sorting, RPCs, etc.
▪ So, 1 TB DRAM for buffering in total
▪ Again, numbers are chosen for illustration only
▪ Apply the example rule of thumb
▪ Indifferent between consuming 1 TB DRAM vs 100 TB/day flash endurance
▪ If this cluster shuffles less than 100 TB/day, then it is efficient to
replace DRAM with Flash
▪ Each node replaces 100 GB DRAM with ~1 TB flash for buffering
▪ Nodes keep some DRAM for sorting, RPCs, etc.
34. Basic Evaluation
Summary for cluster shuffling 100 TB/day
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
Shuffle Service 10
CPU
DRAM for sorting,
RPCs, etc.
100 GB
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
CPU
DRAM for sorting,
RPCs, etc.
DRAM for buffering
Shuffle Service 10
CPU
DRAM for sorting,
RPCs, etc.
1 TB
Flash for buffering
36. Two Hybrid Techniques
Two ways to use both DRAM and flash for buffering
1. Buffer in DRAM first, flush to flash only under memory pressure
2. Buffer fastest-filling partitions in DRAM, send slowest-filling
partitions to flash
37. Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes buffered in
Cosco Shuffle Service
38. Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Time
Bytes
buffered
Buffer only in DRAM Buffer only in flash
1 TB
100 TB written/day
39. Hybrid Technique #1
Take advantage of variation in shuffle workload over time
Buffer only in DRAM
Buffer only in flash
1 TB
100 TB written/day
Hybrid
Buffer in DRAM and flash
250 GB
25 TB written/day
40. Hybrid Technique #1
Buffer in DRAM first, flush to flash only under memory pressure
250 GB DRAM
25 TB written/day to flash
▪ Example: 25% RAM +
25% flash supports
100% throughput
▪ Spikier workload -> more win
▪ Safer to push the
system to its limits
▪ Run out of memory -> immediate bad
consequences
▪ But exceed flash endurance guidelines
-> okay if you make up for it by writing
less in the future
41. Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Implementation requires balancing. Flash adds another dimension. How to adapt balancing logic?
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
???
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Flush to
Flash
Shuffle Service is
out of DRAM
Shuffle Service is
out of DRAM
42. Hybrid Technique #1
Buffer in DRAM first, flush to flashPure-DRAM Cosco
Plug into pre-existing balancing logic
Shuffle Service is
out of DRAM
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Shuffle Service is
out of DRAM
Same logic
Flash
working set
smaller than
THRESHOLD
?
No
Flush to
Flash
Yes
43. Hybrid Technique #1
Plug into pre-existing balancing logic
Balancing
Logic
Redirect to another
shuffle service
Flush to
DFS
Backpressure
mappers
Shuffle Service is
out of DRAM
Flash
working set
smaller than
THRESHOLD
?
No
Flush to
Flash
Yes
▪ THRESHOLD limits flash working set
size
▪ Configure THRESHOLD to stay under
flash endurance limits
▪ Then predict cluster performance as if
working-set flash were DRAM
44. Hybrid Technique #1
Summary
▪ Take advantage of variation in
total shuffle workload over time
▪ Buffer in DRAM first, flush to
flash only under memory
pressure
▪ Adapt balancing logic
45. Hybrid Technique #2
Take advantage of variation in partition fill rate
▪ Some partitions fill more slowly than others
▪ Slower partitions wear out flash less quickly
▪ So, use flash to buffer slower partitions, and use DRAM to buffer faster
partitions
46. Hybrid Technique #2
▪ 1 TB
▪ Supports 100K streams each
buffering up to 10MB
▪ 10 TB, 100 TB written/day
▪ 100K streams each writing 1 GB/day
which is 12 KB/second. (Sanity check:
5 min map stage -> 3.6 MB partition.)
▪ Or 200K streams each writing
6KB/second -> These streams are
better on flash
▪ Or 50K streams each writing 24
KB/second -> These streams would
be better on DRAM
FlashDRAM
Take advantage of variation in partition fill rate: Illustrated with numbers
47. Hybrid Technique #2
Buffer fastest-filling partitions in DRAM and slowest-filling partitions in flash
▪ Technique
▪ Periodically measure partition fill rate
▪ If fill rate is less than threshold KB/s, then buffer partition data in flash
▪ Else, buffer partition data in DRAM
▪ Evaluation
▪ Assume “break-even” threshold of 12 KB/s from previous slide
▪ Suppose that 50% of buffer time is spent on partitions that are slower than 12 KB/s
▪ Suppose these slow partitions write an average of 3 KB/s
▪ Then, you can replace half of your buffering DRAM with 25% as much flash
48. Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Percentile of partitions
Fill rate,
log scale
0 KiB/sec
1st
MiB’s/sec
99th
49. Hybrid Technique #2
Real-world partition fill rates
Percentile of partitions
Percentile of partitions weighted by buffering time
Fill rate
0 KiB/sec
1st
MiB’s/sec
99th
Percentile of partitions
Percentile of partitions weighted by buffering time
Fill rate,
log scale
0 KiB/sec
1st
MiB’s/sec
99th
50. Combine both hybrid techniques
Buffer in DRAM first, then send the slowest partitions to flash when under memory pressure
▪ Evaluation
▪ Difficult theoretical estimation
▪ Or, do a discrete-event simulation -> Later in this presentation
52. Lower-Latency Queries
Made possible by flash
▪ Serve shuffle data directly from flash for some jobs
▪ This is “free” until flash drive gets so full that write amplification factor increases (~80% full)
▪ Prioritize interactive/low-latency queries to serve from flash
▪ Buffer bigger chunks to decrease reducer merging
▪ Fewer chunks -> Less chance that reducer needs to do an external merge
53. Further Efficiency Wins
Made possible by flash
▪ Decrease Cosco replication factor since flash is non-volatile
▪ Currently Cosco replication is R2: Each map output byte is stored on two shuffle services until it is flushed to durable DFS
▪ Most Shuffle Service crashes in production are resolved in a few minutes with process restart
▪ Decrease Cosco replication to R1 for some queries, and attempt to automatically recover map output data from flash after restart
▪ Buffer bigger chunks to allow more efficient Reed-Solomon encodings
on DFS
55. Practical Evaluation Techniques
▪ Discrete event simulation
▪ Synthetic load generation on a test cluster
▪ Shadow testing on a test cluster
▪ Special canary in a production cluster
57. Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Time: 00h:01m:30.000s
Total KB written to flash: 9,000
Overall avg file size written to DFS: NaN
58. Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.250s
Total KB written to flash: 9,050
Overall avg file size written to DFS: NaN
59. Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.500s
Total KB written to flash: 9,100
Overall avg file size written to DFS: NaN
60. Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:30.750s
Total KB written to flash: 9,150
Overall avg file size written to DFS: NaN
61. Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31.000s
Total KB written to flash: 9,200
Overall avg file size written to DFS: NaN
62. Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:31.500s
Total KB written to flash: 9,250
Overall avg file size written to DFS: NaN
63. Discrete Event Simulation
Shuffle Service Model DFS Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32.000s
Total KB written to flash: 9,300
Overall avg file size written to DFS: NaN
64. DFS Model
File 0
Discrete Event Simulation
Shuffle Service Model
Example
Partition 3
Partition 42
Sort & flush
Discrete event
Time: 00h:01m:32.000s
Total KB written to flash: 9,300
Overall avg file size written to DFS: NaN9,200
65. DFS Model
File 0
Discrete Event Simulation
Shuffle Service Model
Example
Partition 3
Partition 42
Discrete event
Time: 00h:01m:32.500s
Total KB written to flash: 9,350
Overall avg file size written to DFS: NaN9,200
66. Discrete Event Simulation
Drive simulation based on production data
cosco_chunks dataset
Partition
Shuffle
Service ID
Chunk (DFS
file) number
Chunk Start
Time
Chunk
Size
Chunk
Buffering Time
Chunk Fill Rate (derived from
size and buffering time)
3 10 5
2020-05-19
00:00:00.000
10
MiB
5000ms 2 MiB/s
42 10 2
2020-05-19
00:01:00.000
31
MiB
10000ms 3.1 MiB/s
…
…
67. Canary on a Production Cluster
▪ Many important metrics are observed on mappers
▪ Example: “percentage of task time spent shuffling”
▪ Example: “map task success rate”
▪ Problem: Mappers talk to many Shuffle Services
▪ Simultaneously
▪ Dynamic balancing can re-route to different Shuffle Services
▪ Solution: Subclusters
▪ Pre-existing feature for large clusters
▪ Each Shuffle Service belongs to one subcluster
▪ Each mapper is assigned to one subcluster, and only uses Shuffle Services in that subcluster
▪ Compare performance of subclusters that contain flash machines vs subclusters that don’t