In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Deep Dive into the New Features of Apache Spark 3.0Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future.
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
Apache Beam is a unified programming model for batch and streaming data processing. It defines concepts for describing what computations to perform (the transformations), where the data is located in time (windowing), when to emit results (triggering), and how to accumulate results over time (accumulation mode). Beam aims to provide portable pipelines across multiple execution engines, including Apache Flink, Apache Spark, and Google Cloud Dataflow. The talk will cover the key concepts of the Beam model and how it provides unified, efficient, and portable data processing pipelines.
Test strategies for data processing pipelinesLars Albertsson
This document discusses strategies for testing data processing pipelines. It begins by introducing various companies and speakers working with data applications and pipelines. It then covers topics like the anatomy of streaming and batch data pipelines, suitable test seams, test scopes from unit to integration, and strategies for testing streaming jobs, batch pipelines, and data quality. Anti-patterns for data pipeline testing are also discussed.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
This document provides an introduction to Apache Flink. It begins with an overview of the presenters and structure of the presentation. It then discusses Flink's APIs, architecture, and execution model. Key concepts are explained like streaming vs batch processing, scaling, the job manager and task managers. It provides a demo of Flink's DataSet API for batch processing and explains a WordCount example program. The goal is to get attendees started with Apache Flink.
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
This document discusses Apache Arrow, an open-source library that enables fast and efficient data interchange and processing. It summarizes the growth of Arrow and its ecosystem, including new features like the Arrow C++ query engine and Arrow Rust DataFusion. It also highlights how enterprises are using Arrow to solve challenges around data interoperability, access speed, query performance, and embeddable analytics. Case studies describe how companies like Microsoft, Google Cloud, Snowflake, and Meta leverage Arrow in their products and platforms. The presenter promotes Voltron Data's enterprise subscription and upcoming conference to support business use of Apache Arrow.
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
The document discusses code optimization techniques in Spark SQL's Catalyst optimizer. It describes how function outlining can improve performance of generated Java code by splitting large routines into smaller ones. The document outlines a Spark SQL query optimization case study where outlining a 300+ line routine from Catalyst code generation improved query performance by up to 19% on a Power8 cluster. Overall, the document examines how function outlining and other code generation optimizations in Catalyst can help the Java JIT compiler better optimize Spark SQL queries.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
Apache Hive is the defacto standard for SQL queries over petabytes of data in Hadoop. It is a comprehensive and compliant engine that offers the broadest range of SQL semantics for Hadoop, providing a powerful set of tools for analysts and developers to access Hadoop data. The session will cover the latest advancements in Hive and provide practical tips for maximizing Hive Performance.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=7c8f800cbbef256680db14c78b871f97
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
The document discusses Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka and describes how it is used by many large companies to process streaming data in real-time. Key aspects of Kafka explained include topics, partitions, producers, consumers, consumer groups, and how Kafka is able to achieve high performance through its architecture and design.
Data Security at Scale through Spark and Parquet EncryptionDatabricks
Apple logo is a trademark of Apple Inc. This presentation discusses Parquet encryption at scale using Spark and Parquet. It covers goals of Parquet modular encryption including data privacy, integrity, and performance. It demonstrates writing and reading encrypted Parquet files in Spark and discusses the Apache community roadmap for further integration of Parquet encryption.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
The document introduces the ELK stack, which consists of Elasticsearch, Logstash, Kibana, and Beats. Beats ship log and operational data to Elasticsearch. Logstash ingests, transforms, and sends data to Elasticsearch. Elasticsearch stores and indexes the data. Kibana allows users to visualize and interact with data stored in Elasticsearch. The document provides descriptions of each component and their roles. It also includes configuration examples and demonstrates how to access Elasticsearch via REST.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Catalyst is a framework that defines relational operators and expressions for Spark SQL. It converts DataFrame queries into logical and physical query plans. The logical plans are optimized through rules before being converted to Spark plans by strategies. These strategies implement the query planner to transform logical plans into physical plans that can be executed on RDDs. Key components include the analyzer, optimizer, query planner and strategies like filtering that generate Spark plans from logical plans.
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
View video of this presentation here: https://www.youtube.com/watch?v=vxeLcoELaP4
Introducing DataFrames in Spark for Large-scale Data Science
This document provides an introduction to Apache Flink. It begins with an overview of the presenters and structure of the presentation. It then discusses Flink's APIs, architecture, and execution model. Key concepts are explained like streaming vs batch processing, scaling, the job manager and task managers. It provides a demo of Flink's DataSet API for batch processing and explains a WordCount example program. The goal is to get attendees started with Apache Flink.
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureKai Wähner
Apache Kafka in conjunction with Apache Spark became the de facto standard for processing and analyzing data. Both frameworks are open, flexible, and scalable.
Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use serverless SaaS offerings to focus on business logic. However, hybrid and multi-cloud scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden.
This session explores different architectures to build serverless Apache Kafka and Apache Spark multi-cloud architectures across regions and continents.
We start from the analytics perspective of a data lake and explore its relation to a fully integrated data streaming layer with Kafka to build a modern data Data Lakehouse.
Real-world use cases show the joint value and explore the benefit of the "delta lake" integration.
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Databricks
Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
This document discusses Apache Arrow, an open-source library that enables fast and efficient data interchange and processing. It summarizes the growth of Arrow and its ecosystem, including new features like the Arrow C++ query engine and Arrow Rust DataFusion. It also highlights how enterprises are using Arrow to solve challenges around data interoperability, access speed, query performance, and embeddable analytics. Case studies describe how companies like Microsoft, Google Cloud, Snowflake, and Meta leverage Arrow in their products and platforms. The presenter promotes Voltron Data's enterprise subscription and upcoming conference to support business use of Apache Arrow.
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
The document discusses code optimization techniques in Spark SQL's Catalyst optimizer. It describes how function outlining can improve performance of generated Java code by splitting large routines into smaller ones. The document outlines a Spark SQL query optimization case study where outlining a 300+ line routine from Catalyst code generation improved query performance by up to 19% on a Power8 cluster. Overall, the document examines how function outlining and other code generation optimizations in Catalyst can help the Java JIT compiler better optimize Spark SQL queries.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks
Apache Hive is the defacto standard for SQL queries over petabytes of data in Hadoop. It is a comprehensive and compliant engine that offers the broadest range of SQL semantics for Hadoop, providing a powerful set of tools for analysts and developers to access Hadoop data. The session will cover the latest advancements in Hive and provide practical tips for maximizing Hive Performance.
Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community.
Recording: https://hortonworks.webex.com/hortonworks/lsr.php?RCID=7c8f800cbbef256680db14c78b871f97
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
The document discusses Kafka, an open-source distributed event streaming platform. It provides an introduction to Kafka and describes how it is used by many large companies to process streaming data in real-time. Key aspects of Kafka explained include topics, partitions, producers, consumers, consumer groups, and how Kafka is able to achieve high performance through its architecture and design.
Data Security at Scale through Spark and Parquet EncryptionDatabricks
Apple logo is a trademark of Apple Inc. This presentation discusses Parquet encryption at scale using Spark and Parquet. It covers goals of Parquet modular encryption including data privacy, integrity, and performance. It demonstrates writing and reading encrypted Parquet files in Spark and discusses the Apache community roadmap for further integration of Parquet encryption.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
The document introduces the ELK stack, which consists of Elasticsearch, Logstash, Kibana, and Beats. Beats ship log and operational data to Elasticsearch. Logstash ingests, transforms, and sends data to Elasticsearch. Elasticsearch stores and indexes the data. Kibana allows users to visualize and interact with data stored in Elasticsearch. The document provides descriptions of each component and their roles. It also includes configuration examples and demonstrates how to access Elasticsearch via REST.
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Trevor Zhang & Vino Yang (T3Go)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Catalyst is a framework that defines relational operators and expressions for Spark SQL. It converts DataFrame queries into logical and physical query plans. The logical plans are optimized through rules before being converted to Spark plans by strategies. These strategies implement the query planner to transform logical plans into physical plans that can be executed on RDDs. Key components include the analyzer, optimizer, query planner and strategies like filtering that generate Spark plans from logical plans.
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
View video of this presentation here: https://www.youtube.com/watch?v=vxeLcoELaP4
Introducing DataFrames in Spark for Large-scale Data Science
This document provides an overview of Spark Catalyst including:
- Catalyst trees and expressions represent logical and physical query plans
- Expressions have datatypes and operate on Row objects
- Custom expressions can be defined
- Code generation improves expression evaluation performance by generating Java code via Janino compiler
- Key concepts like trees, expressions, datatypes, rows, code generation and Janino compiler are explained through examples
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerSachin Aggarwal
RDD recap
Spark SQL library
Architecture of Spark SQL
Comparison with Pig and Hive Pipeline
DataFrames
Definition of a DataFrames API
DataFrames Operations
DataFrames features
Data cleansing
Diagram for logical plan container
Plan Optimization & Execution
Catalyst Analyzer
Catalyst Optimizer
Generating Physical Plan
Code Generation
Extensions
Anatomy of Data Source API : A deep dive into Spark Data source APIdatamantra
In this presentation, we discuss how to build a datasource from the scratch using spark data source API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_datasource_api
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
Catalyst is becoming one of the most important components in Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames, Datasets, to streaming. At its core, Catalyst is a general library for manipulating trees. Based on this library, we have built a modular compiler frontend for Spark, including a query analyzer, optimizer, and an execution planner. In this talk, I will first introduce the concepts of Catalyst trees, followed by major features that were added in order to support Spark’s powerful API abstractions. Audience will walk away with a deeper understanding of how Spark 2.0 works under the hood.
Beyond SQL: Speeding up Spark with DataFramesDatabricks
This document summarizes Spark SQL and DataFrames in Spark. It notes that Spark SQL is part of the core Spark distribution and allows running SQL and HiveQL queries. DataFrames provide a way to select, filter, aggregate and plot structured data like in R and Pandas. DataFrames allow writing less code through a high-level API and reading less data by using optimized formats and partitioning. The optimizer can optimize queries across functions and push down predicates to read less data. This allows creating and running Spark programs faster.
This document provides an overview of functional programming concepts in Scala. It discusses the history and advantages of functional programming. It then covers the basics of Scala including its support for object oriented and functional programming. Key functional programming aspects of Scala like immutable data, higher order functions, and implicit parameters are explained with examples.
The document provides an overview of Apache Spark internals and Resilient Distributed Datasets (RDDs). It discusses:
- RDDs are Spark's fundamental data structure - they are immutable distributed collections that allow transformations like map and filter to be applied.
- RDDs track their lineage or dependency graph to support fault tolerance. Transformations create new RDDs while actions trigger computation.
- Operations on RDDs include narrow transformations like map that don't require data shuffling, and wide transformations like join that do require shuffling.
- The RDD abstraction allows Spark's scheduler to optimize execution through techniques like pipelining and cache reuse.
1) Spark 1.0 was released in 2014 as the first production-ready version containing Spark batch, streaming, Shark, and machine learning libraries.
2) By 2014, most big data processing used higher-level tools like Hive and Pig on structured data rather than the original MapReduce assumption of only unstructured data.
3) Spark evolved to support structured data through the DataFrame API in versions 1.2-1.3, providing a unified way to read from structured sources.
This document provides an introduction and overview of Spark's Dataset API. It discusses how Dataset combines the best aspects of RDDs and DataFrames into a single API, providing strongly typed transformations on structured data. The document also covers how Dataset moves Spark away from RDDs towards a more SQL-like programming model and optimized data handling. Key topics include the Spark Session entry point, differences between DataFrames and Datasets, and examples of Dataset operations like word count.
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
This document discusses Catalyst, the query optimizer in Apache Spark. It begins by explaining how Catalyst works at a high level, including how it abstracts user programs as trees and uses transformations and strategies to optimize logical and physical plans. It then provides more details on specific aspects like rule execution, ensuring requirements, and examples of optimizations. The document aims to help users understand how Catalyst optimizes queries automatically and provides tips on exploring its code and writing optimizations.
Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure. RDDs are immutable, lazy evaluated collections of data that can be operated on in parallel. This allows RDD transformations to be computed lazily and combined for better performance. RDDs also support type inference and caching to improve efficiency. Spark programs run by submitting jobs to a cluster manager like YARN or Mesos, which then schedule tasks across worker nodes where the lazy transformations are executed.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Spark SQL is a component of Apache Spark that introduces SQL support. It includes a DataFrame API that allows users to write SQL queries on Spark, a Catalyst optimizer that converts logical queries to physical plans, and data source APIs that provide a unified way to read/write data in various formats. Spark SQL aims to make SQL queries on Spark more efficient and extensible.
Spark Streaming 2.0 introduces Structured Streaming which addresses some areas for improvement in Spark Streaming 1.X. Structured Streaming builds streaming queries on the Spark SQL engine, providing implicit benefits like extending the primary batch API to streaming and gaining an optimizer. It introduces a more seamless API between batch and stream processing, supports event time semantics, and provides end-to-end fault tolerance guarantees through checkpointing. Structured Streaming also aims to simplify streaming application development by managing streaming queries and allowing continuous queries to be started, stopped, and modified more gracefully.
Introduction to Structured Data Processing with Spark SQLdatamantra
An introduction to structured data processing using Data source and Dataframe API's of spark.Presented at Bangalore Apache Spark Meetup by Madhukara Phatak on 31/05/2015.
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Apache Calcite: One Frontend to Rule Them AllMichael Mior
Apache Calcite is an open source framework that allows for a unified query interface over heterogeneous data sources. It provides an ANSI-compliant SQL parser, a logical query optimizer, and acts as a middleware layer that can integrate data from multiple sources. Calcite uses a relational algebra approach and has pluggable adapters that allow it to connect to different backends like MySQL, MongoDB, and streaming data sources. It supports features like SQL queries, views, optimization rules, and works across both batch and streaming data. The project aims to continue adding new capabilities like geospatial queries and improved cost modeling.
Knolx was about to spark structured streaming. The focus was about the difference between three APIs RDD, DataFrame, and Datasets. And some key concepts of structured streaming like schema, output modes, operations like selection, projection, aggregation, windowing, etc
Fast federated SQL with Apache CalciteChris Baynes
This document discusses Apache Calcite, an open source framework for federated SQL queries. It provides an introduction to Calcite and its components. It then evaluates Calcite's performance on single data sources through benchmarks. Lastly, it proposes a hybrid approach to enable efficient federated queries using Calcite and Spark.
Catalyst optimizer optimizes queries written in Spark SQL and DataFrame API to run faster. It uses both rule-based and cost-based optimization. Rule-based optimization applies rules to determine query execution, while cost-based generates multiple plans and selects the most efficient. Catalyst optimizer transforms logical plans through four phases - analysis, logical optimization, physical planning, and code generation. It represents queries as trees that can be manipulated using pattern matching rules to optimize queries.
This document discusses best practices for migrating Spark applications from version 1.x to 2.0. It covers new features in Spark 2.0 like the Dataset API, catalog API, subqueries and checkpointing for iterative algorithms. The document recommends changes to existing best practices around choice of serializer, cache format, use of broadcast variables and choice of cluster manager. It also discusses how Spark 2.0's improved SQL support impacts use of HiveContext.
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
Sale Stock Engineering, represented by Garindra Prahandono, presents "High-Velocity GraphQL & Lambda-based Software Development Model" in BandungJS event on May 14th, 2018.
Spark SQL allows users to perform relational operations on Spark's RDDs using a DataFrame API. It addresses challenges in existing systems like limited optimization and data sources by providing a DataFrame API that can query both external data and RDDs. Spark SQL leverages a highly extensible optimizer called Catalyst to optimize logical query plans into efficient physical query plans using features of Scala. It has been part of the Spark core distribution since version 1.0 in 2014.
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
This document introduces Spark 2.0 and its key features, including the Dataset abstraction, Spark Session API, moving from RDDs to Datasets, Dataset and DataFrame APIs, handling time windows, and adding custom optimizations. The major focus of Spark 2.0 is standardizing on the Dataset abstraction and improving performance by an order of magnitude. Datasets provide a strongly typed API that combines the best of RDDs and DataFrames.
Recent Developments In SparkR For Advanced AnalyticsDatabricks
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
Sesja pokazująca zarówno Machine Learning Server (czyli algorytmy uczenia maszynowego w językach R i Python), ale także możliwość korzystania z danych JSON w SQL Server, czy też łączenia się do danych znajdujących się na HDFS, HADOOP, czy Spark poprzez Polybase w SQL Server, by te dane wykorzystywać do analizy, predykcji poprzez modele w językach R lub Python.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Apache Spark has been gaining steam, with rapidity, both in the headlines and in real-world adoption. Spark was developed in 2009, and open sourced in 2010. Since then, it has grown to become one of the largest open source communities in big data with over 200 contributors from more than 50 organizations. This open source analytics engine stands out for its ability to process large volumes of data significantly faster than contemporaries such as MapReduce, primarily owing to in-memory storage of data on its own processing framework. That being said, one of the top real-world industry use cases for Apache Spark is its ability to process ‘streaming data‘.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Multi Source Data Analysis using Spark and Telliusdatamantra
Multi Source Data Analysis Using Apache Spark and Tellius
This document discusses analyzing data from multiple sources using Apache Spark and the Tellius platform. It covers loading data from different sources like databases and files into Spark DataFrames, defining a data model by joining the sources, and performing analysis like calculating revenues by department across sources. It also discusses challenges like double counting values when directly querying the joined data. The Tellius platform addresses this by implementing a custom query layer on top of Spark SQL to enable accurate multi-source analysis.
State management in Structured Streamingdatamantra
This document discusses state management in Apache Spark Structured Streaming. It begins by introducing Structured Streaming and differentiating between stateless and stateful stream processing. It then explains the need for state stores to manage intermediate data in stateful processing. It describes how state was managed inefficiently in old Spark Streaming using RDDs and snapshots, and how Structured Streaming improved on this with its decoupled, asynchronous, and incremental state persistence approach. The document outlines Apache Spark's implementation of storing state to HDFS and the involved code entities. It closes by discussing potential issues with this approach and how embedded stores like RocksDB may help address them in production stream processing systems.
Spark can run on Kubernetes containers in two ways - as a static cluster or with native integration. As a static cluster, Spark pods are manually deployed without autoscaling. Native integration treats Kubernetes as a resource manager, allowing Spark to dynamically acquire and release containers like in YARN. It uses Kubernetes custom controllers to create driver pods that then launch worker pods. This provides autoscaling of resources based on job demands.
Understanding transactional writes in datasource v2datamantra
This document discusses the new Transactional Writes in Datasource V2 API introduced in Spark 2.3. It outlines the shortcomings of the previous V1 write API, specifically the lack of transaction support. It then describes the anatomy of the new V2 write API, including interfaces like DataSourceWriter, DataWriterFactory, and DataWriter that provide transactional capabilities at the partition and job level. It also covers how the V2 API addresses partition awareness through preferred location hints to improve performance.
This document discusses exploratory data analysis (EDA) techniques that can be performed on large datasets using Spark and notebooks. It covers generating a five number summary, detecting outliers, creating histograms, and visualizing EDA results. EDA is an interactive process for understanding data distributions and relationships before modeling. Spark enables interactive EDA on large datasets using notebooks for visualizations and Pandas for local analysis.
The DAGScheduler is responsible for computing the DAG of stages for a Spark job and submitting them to the TaskScheduler. The TaskScheduler then submits individual tasks from each stage for execution and works with the DAGScheduler to handle failures through task and stage retries. Together, the DAGScheduler and TaskScheduler coordinate the execution of jobs by breaking them into independent stages of parallel tasks across executor nodes.
This document discusses optimizing Spark write-heavy workloads to S3 object storage. It describes problems with eventual consistency, renames, and failures when writing to S3. It then presents several solutions implemented at Qubole to improve the performance of Spark writes to Hive tables and directly writing to the Hive warehouse location. These optimizations include parallelizing renames, writing directly to the warehouse, and making recover partitions faster by using more efficient S3 listing. Performance improvements of up to 7x were achieved.
This document provides an overview of structured streaming with Kafka in Spark. It discusses data collection vs ingestion and why they are key. It also covers Kafka architecture and terminology. It describes how Spark integrates with Kafka for streaming data sources. It explains checkpointing in structured streaming and using Kafka as a sink. The document discusses delivery semantics and how Spark supports exactly-once semantics with certain output stores. Finally, it outlines new features in Kafka for exactly-once guarantees and the future of structured streaming.
Understanding time in structured streamingdatamantra
This document discusses time abstractions in structured streaming. It introduces process time, event time, and ingestion time. It explains how to use the window API to apply windows over these different time abstractions. It also discusses handling late events using watermarks and implementing non-time based windows using custom state management and sessionization.
Spark stack for Model life-cycle managementdatamantra
This document summarizes a presentation given by Samik Raychaudhuri on [24]7's use of Apache Spark for automating the lifecycle of prediction models. [24]7 builds machine learning models on large customer interaction data to predict customer intent and provide personalized experiences. Previously, models were managed using Vertica, but Spark provides faster, more scalable distributed processing. The new platform uses Spark for regular model building from HDFS data, and the trained models can be deployed on [24]7's production systems. Future work includes using Spark to train more complex models like deep learning for chatbots.
This document discusses best practices for productionalizing machine learning models built with Spark ML. It covers key stages like data preparation, model training, and operationalization. For data preparation, it recommends handling null values, missing data, and data types as custom Spark ML stages within a pipeline. For training, it suggests sampling data for testing and caching only required columns to improve efficiency. For operationalization, it discusses persisting models, validating prediction schemas, and extracting feature names from pipelines. The goal is to build robust, scalable and efficient ML workflows with Spark ML.
This document provides an introduction to Structured Streaming in Apache Spark. It discusses the evolution of stream processing, drawbacks of the DStream API, and advantages of Structured Streaming. Key points include: Structured Streaming models streams as infinite tables/datasets, allowing stream transformations to be expressed using SQL and Dataset APIs; it supports features like event time processing, state management, and checkpointing for fault tolerance; and it allows stream processing to be combined more easily with batch processing using the common Dataset abstraction. The document also provides examples of reading from and writing to different streaming sources and sinks using Structured Streaming.
Building real time Data Pipeline using Spark Streamingdatamantra
This document summarizes the key challenges and solutions in building a real-time data pipeline that ingests data from a database, transforms it using Spark Streaming, and publishes the output to Salesforce. The pipeline aims to have a latency of 1 minute with zero data loss and ordering guarantees. Some challenges discussed include handling out of sequence and late arrival events, schema evolution, bootstrap loading, data loss/corruption, and diagnosing issues. Solutions proposed use Kafka, checkpointing, replay capabilities, and careful broker/connect setups to help meet the reliability requirements for the pipeline.
Unit testing in Scala can be done using the Scalatest framework. Scalatest provides different styles like FunSuite, FlatSpec, FunSpec etc. to write unit tests. It allows sharing of fixtures between tests to reduce duplication. Asynchronous testing and mocking frameworks are also supported. When testing Spark applications, the test suite should initialize the Spark context and clean it up. Spark batch and streaming operations can be tested by asserting on DataFrames and controlling the processing time respectively.
Implicit parameters and implicit conversions are Scala language features that allow omitting explicit calls to methods or variables. Implicits enable concise and elegant code through features like dependency injection, context passing, and ad hoc polymorphism. Implicits resolve types at compile-time rather than runtime. While powerful, implicits can cause conflicts and slow compilation if overused. Frameworks like Scala collections, Spark, and Spray JSON extensively use implicits to provide type classes and conversions between Scala and Java types.
Spark 2.0 introduces several major changes including using Dataset as the main abstraction, replacing RDDs for optimized performance. The migration involves updating to Scala 2.11, replacing contexts with SparkSession, using built-in CSV connector, updating RDD-based code to use Dataset APIs, adding checks for cross joins, and updating custom ML transformers. Migrating leverages many of the improvements in Spark 2.0 while addressing breaking changes.
Scalable Spark deployment using Kubernetesdatamantra
The document discusses deploying Spark clusters on Kubernetes. It introduces Kubernetes as a container orchestration platform for deploying containerized applications at scale across cloud and on-prem environments. It describes building a custom Spark 2.1 Docker image and using it to deploy a Spark cluster on Kubernetes with master and worker pods, exposing the Spark UI through a service.
Introduction to concurrent programming with akka actorsdatamantra
This document provides an introduction to concurrent programming with Akka Actors. It discusses concurrency and parallelism, how the end of Moore's Law necessitated a shift to concurrent programming, and introduces key concepts of actors including message passing concurrency, actor systems, actor operations like sending messages, and more advanced topics like routing, supervision, testing, configuration and remote actors.
Interactive Data Analysis in Spark Streamingdatamantra
This document discusses strategies for building interactive streaming applications in Spark Streaming. It describes using Zookeeper as a dynamic configuration source to allow modifying a Spark Streaming application's behavior at runtime. The key points are:
- Zookeeper can be used to track configuration changes and trigger Spark Streaming context restarts through its watch mechanism and Curator library.
- This allows building interactive applications that can adapt to configuration updates without needing to restart the whole streaming job.
- Examples are provided of using Curator caches like node and path caches to monitor Zookeeper for changes and restart Spark Streaming contexts in response.
GenAI for Quant Analytics: survey-analytics.aiInspirient
Pitched at the Greenbook Insight Innovation Competition as apart of IIEX North America 2025 on 30 April 2025 in Washington, D.C.
Join us at survey-analytics.ai!
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsContify
AI competitor analysis helps businesses watch and understand what their competitors are doing. Using smart competitor intelligence tools, you can track their moves, learn from their strategies, and find ways to do better. Stay smart, act fast, and grow your business with the power of AI insights.
For more information please visit here https://www.contify.com/
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
Telangana State, India’s newest state that was carved from the erstwhile state of Andhra
Pradesh in 2014 has launched the Water Grid Scheme named as ‘Mission Bhagiratha (MB)’
to seek a permanent and sustainable solution to the drinking water problem in the state. MB is
designed to provide potable drinking water to every household in their premises through
piped water supply (PWS) by 2018. The vision of the project is to ensure safe and sustainable
piped drinking water supply from surface water sources
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
1. Anatomy of Data Frame
API
A deep dive into the Spark Data Frame API
https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
2. ● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
3. Agenda
● Spark SQL library
● Dataframe abstraction
● Pig/Hive pipleline vs SparkSQL
● Logical plan
● Optimizer
● Different steps in Query analysis
4. Spark SQL library
● Data source API
Universal API for Loading/ Saving structured data
● DataFrame API
Higher level representation for structured data
● SQL interpreter and optimizer
Express data transformation in SQL
● SQL service
Hive thrift server
5. Architecture of Spark SQL
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL
6. DataFrame API
● Single abstraction for representing structured data in
Spark
● DataFrame = RDD + Schema (aka SchemaRDD)
● All data source API’s return DataFrame
● Introduced in 1.3
● Inspired from R and Python panda
● .rdd to convert to RDD representation resulting in RDD
[Row]
● Support for DataFrame DSL in Spark
7. Need for new abstraction
● Single abstraction for structured data
○ Ability to combine data from multiple sources
○ Uniform access from all different language API’s
○ Ability to support multiple DSL’s
● Familiar interface to Data scientists
○ Same API as R/ Panda
○ Easy to convert from R local data frame to Spark
○ New 1.4 SparkR is built around it
8. Data Structure of structured world
● Data Frame is a data structure to represent structured
data, whereas RDD is a data structure for unstructured
data
● Having single data structure allows to build multiple
DSL’s targeting different developers
● All DSL’s will be using same optimizer and code
generator underneath
● Compare with Hadoop Pig and Hive
9. Pig and Hive pipeline
HiveQL
Hive parser
Optimizer
Executor
Hive queries
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
Pig latin
Pig parser
Optimizer
Executor
Pig latin script
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
10. Issue with Pig and Hive flow
● Pig and hive shares a lot similar steps but independent
of each other
● Each project implements it’s own optimizer and
executor which prevents benefiting from each other’s
work
● There is no common data structure on which we can
build both Pig and Hive dialects
● Optimizer is not flexible to accommodate multiple DSL’s
● Lot of duplicate effort and poor interoperability
12. Spark SQL flow
● Multiple DSL’s share same optimizer and executor
● All DSL’s ultimately generate Dataframes
● Catalyst is a new optimizer built from ground up for
Spark which is rule based framework
● Catalyst allows developers to plug custom rules specific
to their DSL
● You can plug your own DSL too!!
13. What is a data frame?
● Data frame is a container for Logical Plan
● Logical Plan is a tree which represents data and
schema
● Every transformation is represented as tree
manipulation
● These trees are manipulated and optimized by catalyst
rules
● Logical plan will be converted to physical plan for
execution
14. Explain Command
● Explain command on dataframe allows us look at these
plans
● There are three types of logical plans
○ Parsed logical plan
○ Analysed Logical Plan
○ Optimized logical Plan
● Explain also shows Physical plan
● DataFrameExample.scala
15. Filter example
● In last example, all plans looked same as there were no
dataframe operations
● In this example, we are going to apply two filters on the
data frame
● Observe generated optimized plan
● Example : FilterExampleTree.scala
16. Optimized Plan
● Optimized plan normally allows spark to plug in set of
optimization rules
● In our example, When multiple filters are added, spark
&& them for better performance
● Even developer can plug in his/her own rules to
optimizer
17. Accessing Plan trees
● Every dataframe is attached with queryExecution object
which allows us to access these plans individually.
● We can access plans as follows
○ parsed plan - queryExecution.logical
○ Analysed - queryExecution.analyzed
○ Optimized - queryExecution.optimizedPlan
● numberedTreeString on the plan allows us to see the
hierarchy
● Example : FilterExampleTree.scala
18. Filter tree representation
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
01 Filter NOT (CAST(c1#0,
DoubleType) = CAST(0,
DoubleType))
00 Filter NOT (CAST(c2#0,
DoubleType) = CAST(0,
DoubleType))
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
Filter (NOT (CAST(c1#0,
DoubleType) = 0.0) && NOT
(CAST(c2#1, DoubleType) = 0.0))
19. Manipulating Trees
● Every optimization in spark-sql is implemented as a tree
or logical transformation
● Series of these transformation allows for modular
optimizer
● All tree manipulations are done using scala case class
● As developer we can write these manipulations too
● Let’s create an OR filter rather than and
● OrFilter.scala
20. Understanding steps in plan
● Logical plan goes through series of rules to resolve and
optimize plan
● Each plan is a Tree manipulation we seen before
● We can apply series of rules to see how a given plan
evolves over time
● This understanding allows us to understand how to
tweak given query for better performance
● Ex : StepsInQueryPlanning.scala
21. Query
select a.customerId from (
select customerId , amountPaid as
amount from sales where 1 = '1') a
where amount=500.0
22. Parsed Plan
● This is plan generated after parsing the DSL
● Normally these plans generated by the specific parsers
like HiveQL parser, Dataframe DSL parser etc
● Usually they recognize the different transformations and
represent them in the tree nodes
● It’s a straightforward translation without much tweaking
● This will be fed to analyser to generate analysed plan
24. Analyzed plan
● We use sqlContext.analyser access the rules to
generate analyzed plan
● These rules has to be run in sequence to resolve
different entities in the logical plan
● Different entities to be resolved is
○ Relations ( aka Table)
○ References Ex : Subquery, aliases etc
○ Data type casting
25. ResolveRelations Rule
● This rule resolves all the relations ( tables) specified in
the plan
● Whenever it finds a new unresolved relation, it consults
catalyst aka registerTempTable list
● Once it finds the relation, it resolves that with actual
relationship
27. ResolveReferences
● This rule resolves all the references in the Plan
● All aliases and column names get a unique number
which allows parser to locate them irrespective of their
position
● This unique numbering allows subqueries to removed
for better optimization
29. PromoteString
● This rule allows analyser to promote string to right data
types
● In our query, Filter( 1=’1’) we are comparing a double
with string
● This rule puts a cast from string to double to have the
right semantics
32. Eliminate Subqueries
● This rule allows analyser to eliminate superfluous sub
queries
● This is possible as we have unique identifier for each of
the references
● Removal of sub queries allows us to do advanced
optimization in subsequent steps
34. Constant Folding
● Simplifies expressions which result in constant values
● In our plan, Filter(1=1) always results in true
● So constant folding replaces it in true
36. Simplify Filters
● This rule simplifies filters by
○ Removes always true filters
○ Removes entire plan subtree if filter is false
● In our query, the true Filter will be removed
● By simplifying filters, we can avoid multiple iterations on
data
38. PushPredicateThroughFilter
● It’s always good to have filters near to the data source
for better optimizations
● This rules pushes the filters near to the JsonRelation
● When we rearrange the tree nodes, we need to make
sure we rewrite the rule match the aliases
● In our example, the filter rule is rewritten to use alias
amountPaid rather than amount
40. Project Collapsing
● Removes unnecessary projects from the plan
● In our plan , we don’t need second projection, i.e
customerId, amount Paid as we only require one
projection i.e customerId
● So we can get rid of the second projection
● This gives us most optimized plan
42. Generating Physical Plan
● Catalyser can take a logical plan and turn into a
physical plan or Spark plan
● On queryExecutor, we have a plan called executedPlan
which gives us physical plan
● On physical plan, we can call executeCollect or
executeTake to start evaluating the plan