SlideShare a Scribd company logo
Tomer Shiran
Co-Founder
@tshiran
Analytics on modern
data is incredibly hard
Unprecedented complexity
The demands for data
are growing rapidly
Increasing demands
Reporting
New products
Forecasting
Threat detection
BI
Machine
Learning
Segmenting
Fraud prevention
Your analysts are hungry for data
SQL
But your data is everywhere
And it’s not in the shape they need
Today you engineer data flows and reshaping
Data Staging
• Custon ETL
• Fragile transforms
• Slow moving
SQL
Today you engineer data flows and reshaping
Data Staging
Data Warehouse
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
Today you engineer data flows and reshaping
Data Staging
Data Warehouse
Cubes, BI Extracts &
Aggregation Tables • Data sprawl
• Governance issues
• Slow to update
• $$$
• High overhead
• Proprietary lock in
• Custon ETL
• Fragile transforms
• Slow moving
SQL
+
+
+
+
+
+
+
+
+
Lots of Copies…
How can we Tackle this Age-old
Problem?
Direct access to data In-memory, GPU,
…
Columnar Distributed
Apache Arrow: Process & Move Data
Fast
• Top-level Apache project as of Feb 2016
• Collaboration among many open source projects around shared needs
• Three components:
• Language-independent columnar data structures
• Implementations available for C++, Java, Python
• Metadata for describing schemas/record batches
• Protocol for moving data between between processes without
serialization overhead
High-Performance Data Interchange
Today With Arrow
• Each system has its own internal memory format
• 70-80% CPU wasted on serialization and
deserialization
• Similar functionality implemented in multiple projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality (eg, Parquet-
to-Arrow reader)
Data is Organized in Record Batches
Schema
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Record Batch
Schema & File
Layout
Streaming Format File Format
Each Record Batch is Columnar
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs:
Example: Spark to
Pandas via Apache
Arrow
Fast Import of Arrow in Pandas & R
Credit: Wes McKinney, Two Sigma
Fast Export of Arrow in Spark
• Legacy export from Spark to Pandas (toPandas) was extremely
slow
• Row-by-row conversion from Spark driver to Python memory
• SPARK-13534 introduced an Arrow based implementation
• Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and
Yin Xusen (IBM)
• Set spark.sql.execution.arrow.enable = True
Clock Time 12.5s 1.89s (6.6x)
Deserialization 88% of the time 1% of the time
Peak memory usage 8x dataset size 2x dataset size
Designing a Virtual Data
Lake Powered by Apache
Arrow
Arrow-based Distributed Execution
Persistent Columnar Cache (Parquet)
In-Memory Columnar Cache (Arrow)
Pandas
R
BI
Data Sources
(NoSQL, RDBMS, Hadoop, S3)
Arrow-based Execution and Integration
Demo
Thank You
• Apache Arrow community
• Strata organizers
• Get involved
• Subscribe to the Arrow ASF lists
• Contribute to the Arrow project
• Want to learn more about Dremio?
• tshiran@dremio.com

More Related Content

What's hot (20)

Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
Introduction to Dremio
Introduction to DremioIntroduction to Dremio
Introduction to Dremio
Dremio Corporation
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
lucenerevolution
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Altinity Ltd
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxData
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
Databricks
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
lucenerevolution
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 

Viewers also liked (8)

Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Julian Hyde
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Julian Hyde
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 

Similar to Building a Virtual Data Lake with Apache Arrow (20)

Apache Arrow
Apache ArrowApache Arrow
Apache Arrow
Mike Frampton
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
PyCon Ireland 2022 - PyArrow full stack.pdf
PyCon Ireland 2022 - PyArrow full stack.pdfPyCon Ireland 2022 - PyArrow full stack.pdf
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Speeding up PySpark with Arrow
Speeding up PySpark with ArrowSpeeding up PySpark with Arrow
Speeding up PySpark with Arrow
Rubén Berenguel
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Uwe Korn
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
PyCon Ireland 2022 - PyArrow full stack.pdf
PyCon Ireland 2022 - PyArrow full stack.pdfPyCon Ireland 2022 - PyArrow full stack.pdf
PyCon Ireland 2022 - PyArrow full stack.pdf
Alessandro Molina
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyFulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Uwe Korn
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Uwe Korn
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Speeding up PySpark with Arrow
Speeding up PySpark with ArrowSpeeding up PySpark with Arrow
Speeding up PySpark with Arrow
Rubén Berenguel
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
Julien Le Dem
 

Recently uploaded (20)

Top 5 Odoo Modules for the EPC Industry.pdf
Top 5 Odoo Modules for the EPC Industry.pdfTop 5 Odoo Modules for the EPC Industry.pdf
Top 5 Odoo Modules for the EPC Industry.pdf
SatishKumar2651
 
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdfBoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
Ortus Solutions, Corp
 
Climate-Smart Agriculture Development Solution.pptx
Climate-Smart Agriculture Development Solution.pptxClimate-Smart Agriculture Development Solution.pptx
Climate-Smart Agriculture Development Solution.pptx
julia smits
 
Salesforce Experience Cloud Consulting.pdf
Salesforce Experience Cloud Consulting.pdfSalesforce Experience Cloud Consulting.pdf
Salesforce Experience Cloud Consulting.pdf
VALiNTRY360
 
TUG Brazil - VizQL Data Service - Nik Dutra.pdf
TUG Brazil - VizQL Data Service - Nik Dutra.pdfTUG Brazil - VizQL Data Service - Nik Dutra.pdf
TUG Brazil - VizQL Data Service - Nik Dutra.pdf
Ligia Galvão
 
Scaling up your Snapshot tests, without the friction
Scaling up your Snapshot tests, without the frictionScaling up your Snapshot tests, without the friction
Scaling up your Snapshot tests, without the friction
arnold844201
 
Top 10 Mobile Banking Apps in the USA.pdf
Top 10 Mobile Banking Apps in the USA.pdfTop 10 Mobile Banking Apps in the USA.pdf
Top 10 Mobile Banking Apps in the USA.pdf
LL Technolab
 
Chapter_02.pdf Software process Models.pdf
Chapter_02.pdf Software process Models.pdfChapter_02.pdf Software process Models.pdf
Chapter_02.pdf Software process Models.pdf
MaheenVohra
 
20200823-Intro-to-FIRRTLllllllllllllllllll
20200823-Intro-to-FIRRTLllllllllllllllllll20200823-Intro-to-FIRRTLllllllllllllllllll
20200823-Intro-to-FIRRTLllllllllllllllllll
JonathanSong28
 
Introduction to QM, QA, QC, Bug's priority and severity
Introduction to QM, QA, QC, Bug's priority and severityIntroduction to QM, QA, QC, Bug's priority and severity
Introduction to QM, QA, QC, Bug's priority and severity
Arshad QA
 
Portland Marketo User Group: MOPs & AI - Jeff Canada - May 2025
Portland Marketo User Group: MOPs & AI - Jeff Canada - May 2025Portland Marketo User Group: MOPs & AI - Jeff Canada - May 2025
Portland Marketo User Group: MOPs & AI - Jeff Canada - May 2025
BradBedford3
 
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdfICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
M. Luisetto Pharm.D.Spec. Pharmacology
 
AI Ethics: Integrating Transparency, Fairness, and Privacy in AI Development
AI Ethics: Integrating Transparency, Fairness, and Privacy in AI DevelopmentAI Ethics: Integrating Transparency, Fairness, and Privacy in AI Development
AI Ethics: Integrating Transparency, Fairness, and Privacy in AI Development
Petar Radanliev
 
Improving Engagement with CRM Software for Instance Agents
Improving Engagement with CRM Software for Instance AgentsImproving Engagement with CRM Software for Instance Agents
Improving Engagement with CRM Software for Instance Agents
Insurance Tech Services
 
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
Philip Schwarz
 
Facility Management Solution - TeroTAM CMMS Software
Facility Management Solution - TeroTAM CMMS SoftwareFacility Management Solution - TeroTAM CMMS Software
Facility Management Solution - TeroTAM CMMS Software
TeroTAM
 
How to Create a White Label Crypto Exchange.pdf
How to Create a White Label Crypto Exchange.pdfHow to Create a White Label Crypto Exchange.pdf
How to Create a White Label Crypto Exchange.pdf
zak jasper
 
German Marketo User Group - May 2025 survey results
German Marketo User Group - May 2025 survey resultsGerman Marketo User Group - May 2025 survey results
German Marketo User Group - May 2025 survey results
BradBedford3
 
Menu in Android (Define,Create,Inflate and Click Handler)
Menu in Android (Define,Create,Inflate and Click Handler)Menu in Android (Define,Create,Inflate and Click Handler)
Menu in Android (Define,Create,Inflate and Click Handler)
Nabin Dhakal
 
Why-Choose-an-Authorised-Microsoft-Reseller.pptx
Why-Choose-an-Authorised-Microsoft-Reseller.pptxWhy-Choose-an-Authorised-Microsoft-Reseller.pptx
Why-Choose-an-Authorised-Microsoft-Reseller.pptx
Michael cole
 
Top 5 Odoo Modules for the EPC Industry.pdf
Top 5 Odoo Modules for the EPC Industry.pdfTop 5 Odoo Modules for the EPC Industry.pdf
Top 5 Odoo Modules for the EPC Industry.pdf
SatishKumar2651
 
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdfBoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
Ortus Solutions, Corp
 
Climate-Smart Agriculture Development Solution.pptx
Climate-Smart Agriculture Development Solution.pptxClimate-Smart Agriculture Development Solution.pptx
Climate-Smart Agriculture Development Solution.pptx
julia smits
 
Salesforce Experience Cloud Consulting.pdf
Salesforce Experience Cloud Consulting.pdfSalesforce Experience Cloud Consulting.pdf
Salesforce Experience Cloud Consulting.pdf
VALiNTRY360
 
TUG Brazil - VizQL Data Service - Nik Dutra.pdf
TUG Brazil - VizQL Data Service - Nik Dutra.pdfTUG Brazil - VizQL Data Service - Nik Dutra.pdf
TUG Brazil - VizQL Data Service - Nik Dutra.pdf
Ligia Galvão
 
Scaling up your Snapshot tests, without the friction
Scaling up your Snapshot tests, without the frictionScaling up your Snapshot tests, without the friction
Scaling up your Snapshot tests, without the friction
arnold844201
 
Top 10 Mobile Banking Apps in the USA.pdf
Top 10 Mobile Banking Apps in the USA.pdfTop 10 Mobile Banking Apps in the USA.pdf
Top 10 Mobile Banking Apps in the USA.pdf
LL Technolab
 
Chapter_02.pdf Software process Models.pdf
Chapter_02.pdf Software process Models.pdfChapter_02.pdf Software process Models.pdf
Chapter_02.pdf Software process Models.pdf
MaheenVohra
 
20200823-Intro-to-FIRRTLllllllllllllllllll
20200823-Intro-to-FIRRTLllllllllllllllllll20200823-Intro-to-FIRRTLllllllllllllllllll
20200823-Intro-to-FIRRTLllllllllllllllllll
JonathanSong28
 
Introduction to QM, QA, QC, Bug's priority and severity
Introduction to QM, QA, QC, Bug's priority and severityIntroduction to QM, QA, QC, Bug's priority and severity
Introduction to QM, QA, QC, Bug's priority and severity
Arshad QA
 
Portland Marketo User Group: MOPs & AI - Jeff Canada - May 2025
Portland Marketo User Group: MOPs & AI - Jeff Canada - May 2025Portland Marketo User Group: MOPs & AI - Jeff Canada - May 2025
Portland Marketo User Group: MOPs & AI - Jeff Canada - May 2025
BradBedford3
 
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdfICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
M. Luisetto Pharm.D.Spec. Pharmacology
 
AI Ethics: Integrating Transparency, Fairness, and Privacy in AI Development
AI Ethics: Integrating Transparency, Fairness, and Privacy in AI DevelopmentAI Ethics: Integrating Transparency, Fairness, and Privacy in AI Development
AI Ethics: Integrating Transparency, Fairness, and Privacy in AI Development
Petar Radanliev
 
Improving Engagement with CRM Software for Instance Agents
Improving Engagement with CRM Software for Instance AgentsImproving Engagement with CRM Software for Instance Agents
Improving Engagement with CRM Software for Instance Agents
Insurance Tech Services
 
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
List Unfolding - 'unfold' as the Computational Dual of 'fold', and how 'unfol...
Philip Schwarz
 
Facility Management Solution - TeroTAM CMMS Software
Facility Management Solution - TeroTAM CMMS SoftwareFacility Management Solution - TeroTAM CMMS Software
Facility Management Solution - TeroTAM CMMS Software
TeroTAM
 
How to Create a White Label Crypto Exchange.pdf
How to Create a White Label Crypto Exchange.pdfHow to Create a White Label Crypto Exchange.pdf
How to Create a White Label Crypto Exchange.pdf
zak jasper
 
German Marketo User Group - May 2025 survey results
German Marketo User Group - May 2025 survey resultsGerman Marketo User Group - May 2025 survey results
German Marketo User Group - May 2025 survey results
BradBedford3
 
Menu in Android (Define,Create,Inflate and Click Handler)
Menu in Android (Define,Create,Inflate and Click Handler)Menu in Android (Define,Create,Inflate and Click Handler)
Menu in Android (Define,Create,Inflate and Click Handler)
Nabin Dhakal
 
Why-Choose-an-Authorised-Microsoft-Reseller.pptx
Why-Choose-an-Authorised-Microsoft-Reseller.pptxWhy-Choose-an-Authorised-Microsoft-Reseller.pptx
Why-Choose-an-Authorised-Microsoft-Reseller.pptx
Michael cole
 

Building a Virtual Data Lake with Apache Arrow

  • 2. Analytics on modern data is incredibly hard Unprecedented complexity
  • 3. The demands for data are growing rapidly Increasing demands Reporting New products Forecasting Threat detection BI Machine Learning Segmenting Fraud prevention
  • 4. Your analysts are hungry for data SQL But your data is everywhere And it’s not in the shape they need
  • 5. Today you engineer data flows and reshaping Data Staging • Custon ETL • Fragile transforms • Slow moving SQL
  • 6. Today you engineer data flows and reshaping Data Staging Data Warehouse • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL
  • 7. Today you engineer data flows and reshaping Data Staging Data Warehouse Cubes, BI Extracts & Aggregation Tables • Data sprawl • Governance issues • Slow to update • $$$ • High overhead • Proprietary lock in • Custon ETL • Fragile transforms • Slow moving SQL + + + + + + + + +
  • 9. How can we Tackle this Age-old Problem? Direct access to data In-memory, GPU, … Columnar Distributed
  • 10. Apache Arrow: Process & Move Data Fast • Top-level Apache project as of Feb 2016 • Collaboration among many open source projects around shared needs • Three components: • Language-independent columnar data structures • Implementations available for C++, Java, Python • Metadata for describing schemas/record batches • Protocol for moving data between between processes without serialization overhead
  • 11. High-Performance Data Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet- to-Arrow reader)
  • 12. Data is Organized in Record Batches Schema Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Record Batch Schema & File Layout Streaming Format File Format
  • 13. Each Record Batch is Columnar Intel CPU SELECT * FROM clickstream WHERE session_id = 1331246351 Traditional Memory Buffer Arrow Memory Buffer Arrow leverages the data parallelism (SIMD) in modern Intel CPUs:
  • 14. Example: Spark to Pandas via Apache Arrow
  • 15. Fast Import of Arrow in Pandas & R Credit: Wes McKinney, Two Sigma
  • 16. Fast Export of Arrow in Spark • Legacy export from Spark to Pandas (toPandas) was extremely slow • Row-by-row conversion from Spark driver to Python memory • SPARK-13534 introduced an Arrow based implementation • Wes McKinney (Two Sigma), Bryan Cutler (IBM), Li Jin (Two Sigma), and Yin Xusen (IBM) • Set spark.sql.execution.arrow.enable = True Clock Time 12.5s 1.89s (6.6x) Deserialization 88% of the time 1% of the time Peak memory usage 8x dataset size 2x dataset size
  • 17. Designing a Virtual Data Lake Powered by Apache Arrow
  • 18. Arrow-based Distributed Execution Persistent Columnar Cache (Parquet) In-Memory Columnar Cache (Arrow) Pandas R BI Data Sources (NoSQL, RDBMS, Hadoop, S3) Arrow-based Execution and Integration
  • 19. Demo
  • 20. Thank You • Apache Arrow community • Strata organizers • Get involved • Subscribe to the Arrow ASF lists • Contribute to the Arrow project • Want to learn more about Dremio? • [email protected]

Editor's Notes

  • #3: BI assumes single relational database, but… Data in non-relational technologies Data fragmented across many systems Massive scale and velocity
  • #4: Data is the business, and… Era of impatient smartphone natives Rise of self-service BI Accelerating time to market Because of the complexity of modern data and increasing demands for data, IT gets crushed in the middle: Slow or non-responsive IT “Shadow Analytics” Data governance risk Illusive data engineers Immature software Competing strategic initiatives
  • #5: Here’s the problem everyone is trying to solve today. You have consumers of data with their favorite tools. BI products like Tableau, PowerBI, Qlik, as well as data science tools like Python, R, Spark, and SQL. Then you have all your data, in a mix of relational, NoSQL, Hadoop, and cloud like S3. So how are you going to get the data to the people asking for it?
  • #6: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too.
  • #7: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so …
  • #8: Here’s how everyone tries to solve it: First you move the data out of the operational systems into a staging area, that might be Hadoop, or one of the cloud file systems like S3 or Azure Blob Store. You write a bunch of ETL scripts to move the data. These are expensive to write and maintain, and they’re fragile – when the sources change, the scripts have to change too. Then you move the data into a data warehouse. This could be Redshift, Teradata, Vertica, or other products. These are all proprietary, and they take DBA experts to make them work. And to move the data here you write another set of scripts. But what we see with many customers is that the performance here isn’t sufficient for their needs, and so … You build cubes and aggregation tables to get the performance your users are asking for. And to do this you build another set of scripts. In the end you’re left with something like this picture. You may have more layers, the technologies may be different, but you’re probably living with something like this. And nobody likes this – it’s expensive, the data movement is slow, it’s hard to change. But worst of all, you’re left with a dynamic where every time a consumer of the data wants a new piece of data: They open a ticket with IT IT begins an engineering project to build another set of pipelines, over several weeks or months