SlideShare a Scribd company logo
Anatomy of Data Frame
API
A deep dive into the Spark Data Frame API
https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Spark SQL library
● Dataframe abstraction
● Pig/Hive pipleline vs SparkSQL
● Logical plan
● Optimizer
● Different steps in Query analysis
Spark SQL library
● Data source API
Universal API for Loading/ Saving structured data
● DataFrame API
Higher level representation for structured data
● SQL interpreter and optimizer
Express data transformation in SQL
● SQL service
Hive thrift server
Architecture of Spark SQL
CSV JSON JDBC
Data Source API
Data Frame API
Spark SQL and HQLDataframe DSL
DataFrame API
● Single abstraction for representing structured data in
Spark
● DataFrame = RDD + Schema (aka SchemaRDD)
● All data source API’s return DataFrame
● Introduced in 1.3
● Inspired from R and Python panda
● .rdd to convert to RDD representation resulting in RDD
[Row]
● Support for DataFrame DSL in Spark
Need for new abstraction
● Single abstraction for structured data
○ Ability to combine data from multiple sources
○ Uniform access from all different language API’s
○ Ability to support multiple DSL’s
● Familiar interface to Data scientists
○ Same API as R/ Panda
○ Easy to convert from R local data frame to Spark
○ New 1.4 SparkR is built around it
Data Structure of structured world
● Data Frame is a data structure to represent structured
data, whereas RDD is a data structure for unstructured
data
● Having single data structure allows to build multiple
DSL’s targeting different developers
● All DSL’s will be using same optimizer and code
generator underneath
● Compare with Hadoop Pig and Hive
Pig and Hive pipeline
HiveQL
Hive parser
Optimizer
Executor
Hive queries
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
Pig latin
Pig parser
Optimizer
Executor
Pig latin script
Logical Plan
Optimized Logical
Plan(M/R plan)
Physical Plan
Issue with Pig and Hive flow
● Pig and hive shares a lot similar steps but independent
of each other
● Each project implements it’s own optimizer and
executor which prevents benefiting from each other’s
work
● There is no common data structure on which we can
build both Pig and Hive dialects
● Optimizer is not flexible to accommodate multiple DSL’s
● Lot of duplicate effort and poor interoperability
Spark SQL pipeline
HiveQL
Hive parser
Hive queries
SparkQL
SparkSQL Parser
Spark SQL
queries
Dataframe
DSL
DataFrame
Catalyst
Spark RDD
code
Spark SQL flow
● Multiple DSL’s share same optimizer and executor
● All DSL’s ultimately generate Dataframes
● Catalyst is a new optimizer built from ground up for
Spark which is rule based framework
● Catalyst allows developers to plug custom rules specific
to their DSL
● You can plug your own DSL too!!
What is a data frame?
● Data frame is a container for Logical Plan
● Logical Plan is a tree which represents data and
schema
● Every transformation is represented as tree
manipulation
● These trees are manipulated and optimized by catalyst
rules
● Logical plan will be converted to physical plan for
execution
Explain Command
● Explain command on dataframe allows us look at these
plans
● There are three types of logical plans
○ Parsed logical plan
○ Analysed Logical Plan
○ Optimized logical Plan
● Explain also shows Physical plan
● DataFrameExample.scala
Filter example
● In last example, all plans looked same as there were no
dataframe operations
● In this example, we are going to apply two filters on the
data frame
● Observe generated optimized plan
● Example : FilterExampleTree.scala
Optimized Plan
● Optimized plan normally allows spark to plug in set of
optimization rules
● In our example, When multiple filters are added, spark
&& them for better performance
● Even developer can plug in his/her own rules to
optimizer
Accessing Plan trees
● Every dataframe is attached with queryExecution object
which allows us to access these plans individually.
● We can access plans as follows
○ parsed plan - queryExecution.logical
○ Analysed - queryExecution.analyzed
○ Optimized - queryExecution.optimizedPlan
● numberedTreeString on the plan allows us to see the
hierarchy
● Example : FilterExampleTree.scala
Filter tree representation
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
01 Filter NOT (CAST(c1#0,
DoubleType) = CAST(0,
DoubleType))
00 Filter NOT (CAST(c2#0,
DoubleType) = CAST(0,
DoubleType))
02 LogicalRDD [c1#0,c2#1,
c3#2,c4#3]
Filter (NOT (CAST(c1#0,
DoubleType) = 0.0) && NOT
(CAST(c2#1, DoubleType) = 0.0))
Manipulating Trees
● Every optimization in spark-sql is implemented as a tree
or logical transformation
● Series of these transformation allows for modular
optimizer
● All tree manipulations are done using scala case class
● As developer we can write these manipulations too
● Let’s create an OR filter rather than and
● OrFilter.scala
Understanding steps in plan
● Logical plan goes through series of rules to resolve and
optimize plan
● Each plan is a Tree manipulation we seen before
● We can apply series of rules to see how a given plan
evolves over time
● This understanding allows us to understand how to
tweak given query for better performance
● Ex : StepsInQueryPlanning.scala
Query
select a.customerId from (
select customerId , amountPaid as
amount from sales where 1 = '1') a
where amount=500.0
Parsed Plan
● This is plan generated after parsing the DSL
● Normally these plans generated by the specific parsers
like HiveQL parser, Dataframe DSL parser etc
● Usually they recognize the different transformations and
represent them in the tree nodes
● It’s a straightforward translation without much tweaking
● This will be fed to analyser to generate analysed plan
Parsed Logical Plan
UnResolvedRelation
Sales
`Filter
(1 = 1)
`Projection
'customerId,'amountPaid
`SubQuery
a
`Filter
(amount = 500)
`Project
a.customerId
Analyzed plan
● We use sqlContext.analyser access the rules to
generate analyzed plan
● These rules has to be run in sequence to resolve
different entities in the logical plan
● Different entities to be resolved is
○ Relations ( aka Table)
○ References Ex : Subquery, aliases etc
○ Data type casting
ResolveRelations Rule
● This rule resolves all the relations ( tables) specified in
the plan
● Whenever it finds a new unresolved relation, it consults
catalyst aka registerTempTable list
● Once it finds the relation, it resolves that with actual
relationship
Resolved Relation Logical Plan
JsonRelation
Sales[amountPaid..]
Filter
(1 = 1)
`Projection
'customerId,'amountPaid
`SubQuery
a
`Filter
(amount = 500)
`Project
a.customerId
SubQuery - sales
UnResolvedRelation
Sales
`Filter
(1 = 1)
`Projection
'customerId,'amountPaid
`SubQuery
a
`Filter
(amount = 500)
`Project
a.customerId
ResolveReferences
● This rule resolves all the references in the Plan
● All aliases and column names get a unique number
which allows parser to locate them irrespective of their
position
● This unique numbering allows subqueries to removed
for better optimization
Resolved References Plan
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = 1)
Projection
customerId#1L,amountPaid#0
SubQuery
a
Filter
(amount#4 = 500)
Project
customerId#1L
SubQuery - sales
JsonRelation
Sales[amountPaid..]
`Filter
(1 = 1)
`Projection
'customerId,'amountPaid
`SubQuery
a
`Filter
(amount = 500)
`Project
a.customerId
SubQuery - sales
PromoteString
● This rule allows analyser to promote string to right data
types
● In our query, Filter( 1=’1’) we are comparing a double
with string
● This rule puts a cast from string to double to have the
right semantics
Promote String Plan
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = CAST(1, DoubleType))
Projection
customerId#1L,amountPaid#0
SubQuery
a
Filter
(amount#4 = 500)
Project
customerId#1L
SubQuery - sales
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = 1)
Projection
customerId#1L,amountPaid#0
SubQuery
a
Filter
(amount#4 = 500)
Project
customerId#1L
SubQuery - sales
Optimize
Eliminate Subqueries
● This rule allows analyser to eliminate superfluous sub
queries
● This is possible as we have unique identifier for each of
the references
● Removal of sub queries allows us to do advanced
optimization in subsequent steps
Eliminate subqueries
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = CAST(1, DoubleType))
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = CAST(1, DoubleType))
Projection
customerId#1L,amountPaid#0
SubQuery
a
Filter
(amount#4 = 500)
Project
customerId#1L
SubQuery - sales
Constant Folding
● Simplifies expressions which result in constant values
● In our plan, Filter(1=1) always results in true
● So constant folding replaces it in true
ConstantFoldingPlan
JsonRelation
Sales[amountPaid#0..]
`Filter
True
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
`Filter
(1 = CAST(1, DoubleType))
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
Simplify Filters
● This rule simplifies filters by
○ Removes always true filters
○ Removes entire plan subtree if filter is false
● In our query, the true Filter will be removed
● By simplifying filters, we can avoid multiple iterations on
data
Simplify Filter Plan
JsonRelation
Sales[amountPaid#0..]
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
`Filter
True
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
PushPredicateThroughFilter
● It’s always good to have filters near to the data source
for better optimizations
● This rules pushes the filters near to the JsonRelation
● When we rearrange the tree nodes, we need to make
sure we rewrite the rule match the aliases
● In our example, the filter rule is rewritten to use alias
amountPaid rather than amount
PushPredicateThroughFilter Plan
JsonRelation
Sales[amountPaid#0..]
Filter
(amountPaid#0 = 500)
Projection
customerId#1L,amountPaid#0
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
Projection
customerId#1L,amountPaid#0
Filter
(amount#4 = 500)
Project
customerId#1L
Project Collapsing
● Removes unnecessary projects from the plan
● In our plan , we don’t need second projection, i.e
customerId, amount Paid as we only require one
projection i.e customerId
● So we can get rid of the second projection
● This gives us most optimized plan
Project Collapsing Plan
JsonRelation
Sales[amountPaid#0..]
Filter
(amountPaid#0 = 500)
Projection
customerId#1L,amountPaid#0
Project
customerId#1L
JsonRelation
Sales[amountPaid#0..]
Filter
(amountPaid#0 = 500)
Project
customerId#1L
Generating Physical Plan
● Catalyser can take a logical plan and turn into a
physical plan or Spark plan
● On queryExecutor, we have a plan called executedPlan
which gives us physical plan
● On physical plan, we can call executeCollect or
executeTake to start evaluating the plan
References
● https://www.youtube.com/watch?v=GQSNJAzxOr8
● https://databricks.com/blog/2015/04/13/deep-dive-into-
spark-sqls-catalyst-optimizer.html
● http://spark.apache.org/sql/
Ad

More Related Content

What's hot (20)

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
mxmxm
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
 
kafka
kafkakafka
kafka
Amikam Snir
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Introduction To Flink
Introduction To FlinkIntroduction To Flink
Introduction To Flink
Knoldus Inc.
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
Amr Alaa Yassen
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
mxmxm
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
confluent
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Introduction To Flink
Introduction To FlinkIntroduction To Flink
Introduction To Flink
Knoldus Inc.
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
Bill Liu
 
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks Technical Workshop: Interactive Query with Apache Hive
Hortonworks
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
Jean-Paul Azar
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
Amr Alaa Yassen
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 

Viewers also liked (20)

Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試みデータテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
Yahoo!デベロッパーネットワーク
 
困らない程度のJDK入門
困らない程度のJDK入門困らない程度のJDK入門
困らない程度のJDK入門
Yohei Oda
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
datamantra
 
Spark sql
Spark sqlSpark sql
Spark sql
Freeman Zhang
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
Pietro Michiardi
 
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試みデータテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
データテクノロジースペシャル:Yahoo! JAPANにおけるメタデータ管理の試み
Yahoo!デベロッパーネットワーク
 
困らない程度のJDK入門
困らない程度のJDK入門困らない程度のJDK入門
困らない程度のJDK入門
Yohei Oda
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
datamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
datamantra
 
Ad

Similar to Anatomy of Data Frame API : A deep dive into Spark Data Frame API (20)

Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
datamantra
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured Streaming
Knoldus Inc.
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
Ayub Mohammad
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Garindra Prahandono
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and server
Pavel Chertorogov
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
datamantra
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
Michael Mior
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured Streaming
Knoldus Inc.
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Holden Karau
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Garindra Prahandono
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Recent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced AnalyticsRecent Developments In SparkR For Advanced Analytics
Recent Developments In SparkR For Advanced Analytics
Databricks
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Structured Streaming in Spark
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in Spark
Digital Vidya
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and server
Pavel Chertorogov
 
Ad

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 

Recently uploaded (20)

CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 

Anatomy of Data Frame API : A deep dive into Spark Data Frame API

  • 1. Anatomy of Data Frame API A deep dive into the Spark Data Frame API https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Spark SQL library ● Dataframe abstraction ● Pig/Hive pipleline vs SparkSQL ● Logical plan ● Optimizer ● Different steps in Query analysis
  • 4. Spark SQL library ● Data source API Universal API for Loading/ Saving structured data ● DataFrame API Higher level representation for structured data ● SQL interpreter and optimizer Express data transformation in SQL ● SQL service Hive thrift server
  • 5. Architecture of Spark SQL CSV JSON JDBC Data Source API Data Frame API Spark SQL and HQLDataframe DSL
  • 6. DataFrame API ● Single abstraction for representing structured data in Spark ● DataFrame = RDD + Schema (aka SchemaRDD) ● All data source API’s return DataFrame ● Introduced in 1.3 ● Inspired from R and Python panda ● .rdd to convert to RDD representation resulting in RDD [Row] ● Support for DataFrame DSL in Spark
  • 7. Need for new abstraction ● Single abstraction for structured data ○ Ability to combine data from multiple sources ○ Uniform access from all different language API’s ○ Ability to support multiple DSL’s ● Familiar interface to Data scientists ○ Same API as R/ Panda ○ Easy to convert from R local data frame to Spark ○ New 1.4 SparkR is built around it
  • 8. Data Structure of structured world ● Data Frame is a data structure to represent structured data, whereas RDD is a data structure for unstructured data ● Having single data structure allows to build multiple DSL’s targeting different developers ● All DSL’s will be using same optimizer and code generator underneath ● Compare with Hadoop Pig and Hive
  • 9. Pig and Hive pipeline HiveQL Hive parser Optimizer Executor Hive queries Logical Plan Optimized Logical Plan(M/R plan) Physical Plan Pig latin Pig parser Optimizer Executor Pig latin script Logical Plan Optimized Logical Plan(M/R plan) Physical Plan
  • 10. Issue with Pig and Hive flow ● Pig and hive shares a lot similar steps but independent of each other ● Each project implements it’s own optimizer and executor which prevents benefiting from each other’s work ● There is no common data structure on which we can build both Pig and Hive dialects ● Optimizer is not flexible to accommodate multiple DSL’s ● Lot of duplicate effort and poor interoperability
  • 11. Spark SQL pipeline HiveQL Hive parser Hive queries SparkQL SparkSQL Parser Spark SQL queries Dataframe DSL DataFrame Catalyst Spark RDD code
  • 12. Spark SQL flow ● Multiple DSL’s share same optimizer and executor ● All DSL’s ultimately generate Dataframes ● Catalyst is a new optimizer built from ground up for Spark which is rule based framework ● Catalyst allows developers to plug custom rules specific to their DSL ● You can plug your own DSL too!!
  • 13. What is a data frame? ● Data frame is a container for Logical Plan ● Logical Plan is a tree which represents data and schema ● Every transformation is represented as tree manipulation ● These trees are manipulated and optimized by catalyst rules ● Logical plan will be converted to physical plan for execution
  • 14. Explain Command ● Explain command on dataframe allows us look at these plans ● There are three types of logical plans ○ Parsed logical plan ○ Analysed Logical Plan ○ Optimized logical Plan ● Explain also shows Physical plan ● DataFrameExample.scala
  • 15. Filter example ● In last example, all plans looked same as there were no dataframe operations ● In this example, we are going to apply two filters on the data frame ● Observe generated optimized plan ● Example : FilterExampleTree.scala
  • 16. Optimized Plan ● Optimized plan normally allows spark to plug in set of optimization rules ● In our example, When multiple filters are added, spark && them for better performance ● Even developer can plug in his/her own rules to optimizer
  • 17. Accessing Plan trees ● Every dataframe is attached with queryExecution object which allows us to access these plans individually. ● We can access plans as follows ○ parsed plan - queryExecution.logical ○ Analysed - queryExecution.analyzed ○ Optimized - queryExecution.optimizedPlan ● numberedTreeString on the plan allows us to see the hierarchy ● Example : FilterExampleTree.scala
  • 18. Filter tree representation 02 LogicalRDD [c1#0,c2#1, c3#2,c4#3] 01 Filter NOT (CAST(c1#0, DoubleType) = CAST(0, DoubleType)) 00 Filter NOT (CAST(c2#0, DoubleType) = CAST(0, DoubleType)) 02 LogicalRDD [c1#0,c2#1, c3#2,c4#3] Filter (NOT (CAST(c1#0, DoubleType) = 0.0) && NOT (CAST(c2#1, DoubleType) = 0.0))
  • 19. Manipulating Trees ● Every optimization in spark-sql is implemented as a tree or logical transformation ● Series of these transformation allows for modular optimizer ● All tree manipulations are done using scala case class ● As developer we can write these manipulations too ● Let’s create an OR filter rather than and ● OrFilter.scala
  • 20. Understanding steps in plan ● Logical plan goes through series of rules to resolve and optimize plan ● Each plan is a Tree manipulation we seen before ● We can apply series of rules to see how a given plan evolves over time ● This understanding allows us to understand how to tweak given query for better performance ● Ex : StepsInQueryPlanning.scala
  • 21. Query select a.customerId from ( select customerId , amountPaid as amount from sales where 1 = '1') a where amount=500.0
  • 22. Parsed Plan ● This is plan generated after parsing the DSL ● Normally these plans generated by the specific parsers like HiveQL parser, Dataframe DSL parser etc ● Usually they recognize the different transformations and represent them in the tree nodes ● It’s a straightforward translation without much tweaking ● This will be fed to analyser to generate analysed plan
  • 23. Parsed Logical Plan UnResolvedRelation Sales `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId
  • 24. Analyzed plan ● We use sqlContext.analyser access the rules to generate analyzed plan ● These rules has to be run in sequence to resolve different entities in the logical plan ● Different entities to be resolved is ○ Relations ( aka Table) ○ References Ex : Subquery, aliases etc ○ Data type casting
  • 25. ResolveRelations Rule ● This rule resolves all the relations ( tables) specified in the plan ● Whenever it finds a new unresolved relation, it consults catalyst aka registerTempTable list ● Once it finds the relation, it resolves that with actual relationship
  • 26. Resolved Relation Logical Plan JsonRelation Sales[amountPaid..] Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId SubQuery - sales UnResolvedRelation Sales `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId
  • 27. ResolveReferences ● This rule resolves all the references in the Plan ● All aliases and column names get a unique number which allows parser to locate them irrespective of their position ● This unique numbering allows subqueries to removed for better optimization
  • 28. Resolved References Plan JsonRelation Sales[amountPaid#0..] `Filter (1 = 1) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales JsonRelation Sales[amountPaid..] `Filter (1 = 1) `Projection 'customerId,'amountPaid `SubQuery a `Filter (amount = 500) `Project a.customerId SubQuery - sales
  • 29. PromoteString ● This rule allows analyser to promote string to right data types ● In our query, Filter( 1=’1’) we are comparing a double with string ● This rule puts a cast from string to double to have the right semantics
  • 30. Promote String Plan JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales JsonRelation Sales[amountPaid#0..] `Filter (1 = 1) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales
  • 32. Eliminate Subqueries ● This rule allows analyser to eliminate superfluous sub queries ● This is possible as we have unique identifier for each of the references ● Removal of sub queries allows us to do advanced optimization in subsequent steps
  • 33. Eliminate subqueries JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L JsonRelation Sales[amountPaid#0..] `Filter (1 = CAST(1, DoubleType)) Projection customerId#1L,amountPaid#0 SubQuery a Filter (amount#4 = 500) Project customerId#1L SubQuery - sales
  • 34. Constant Folding ● Simplifies expressions which result in constant values ● In our plan, Filter(1=1) always results in true ● So constant folding replaces it in true
  • 36. Simplify Filters ● This rule simplifies filters by ○ Removes always true filters ○ Removes entire plan subtree if filter is false ● In our query, the true Filter will be removed ● By simplifying filters, we can avoid multiple iterations on data
  • 37. Simplify Filter Plan JsonRelation Sales[amountPaid#0..] Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L JsonRelation Sales[amountPaid#0..] `Filter True Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L
  • 38. PushPredicateThroughFilter ● It’s always good to have filters near to the data source for better optimizations ● This rules pushes the filters near to the JsonRelation ● When we rearrange the tree nodes, we need to make sure we rewrite the rule match the aliases ● In our example, the filter rule is rewritten to use alias amountPaid rather than amount
  • 39. PushPredicateThroughFilter Plan JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Projection customerId#1L,amountPaid#0 Project customerId#1L JsonRelation Sales[amountPaid#0..] Projection customerId#1L,amountPaid#0 Filter (amount#4 = 500) Project customerId#1L
  • 40. Project Collapsing ● Removes unnecessary projects from the plan ● In our plan , we don’t need second projection, i.e customerId, amount Paid as we only require one projection i.e customerId ● So we can get rid of the second projection ● This gives us most optimized plan
  • 41. Project Collapsing Plan JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Projection customerId#1L,amountPaid#0 Project customerId#1L JsonRelation Sales[amountPaid#0..] Filter (amountPaid#0 = 500) Project customerId#1L
  • 42. Generating Physical Plan ● Catalyser can take a logical plan and turn into a physical plan or Spark plan ● On queryExecutor, we have a plan called executedPlan which gives us physical plan ● On physical plan, we can call executeCollect or executeTake to start evaluating the plan