SlideShare a Scribd company logo
25127
The Data Lake Engine
Spark + AI Summit 2020
Data Science Across Data Sources with Apache Arrow
25127
Dremio is the Data Lake Engine CompanyTomer Shiran
Co-Founder & CPO, Dremio
tomer@dremio.com Powering the cloud data lakes of the world’s
leading companies across all industries
Creators of
Over $100M raised
Background
25127
Your Data Lake is Exploding, Yet Your Data Remains Inaccessible
But…
>100% YoY S3
Data Growth1
>50% of Data
Will Live on Cloud Data
Lake Storage by 20252
1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/
2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data
Data Lakes are becoming the
primary place that data lands
Consuming the data is
too slow & too difficult
SQL
Data Consumers
X X X
S3ADLS
S3ADLS
or or
25127
Data Movement is the Typical Workaround for Data Lake Storage
BI Users
SQL
Data Scientists
Data Lake
Storage ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
BI Users
SQL
Data Scientists
1
Brittle & complex
ETL/ELT
Data Lake
Storage ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
1
2
Brittle & complex
ETL/ELT
Data Lake
Storage
Proprietary & expensive
DW/Data Marts
BI Users
SQL
Data Scientists
ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
Proliferating Cubes,
BI Extracts, &
Aggregation Tables
Proprietary & expensive
DW/Data Marts
+
+
+
+
+
+
+
+
+
1
2
3
Brittle & complex
ETL/ELT
D
ecreasingD
ataScope&
Flexibility
Data Lake
Storage
BI Users
SQL
Data Scientists
ADLS S3
25127
Proliferating Cubes,
BI Extracts, &
Aggregation Tables
Proprietary & expensive
DW/Data Marts
+
+
+
+
+
+
+
+
+
1
2
3
Brittle & complex
ETL/ELT
D
ecreasingD
ataScope&
Flexibility
BI Users
SQL
Data Scientists
Data Lake
Storage ADLS S3o
r
o
r
Query data lake storage directly with 4-100X performance
Powered by .
What is Apache Arrow?
Columnar In-
Memory
Representation
Many Language
Bindings
Broad Industry
Adoption
Row-based Column-based
10+ Downloads per Month
25127
Apache Arrow Gandiva Improves CPU Efficiency
✓ A standalone C++ library for efficient
evaluation of arbitrary SQL expressions on
Arrow vectors using runtime code-
generation in LLVM
✓ Expressions are compiled to LLVM bytecode
(IR), optimized & translated to machine code
✓ Gandiva enables vectorized execution with
Intel SIMD instructions
SQL expression
Vectorized
execution
kernel
Input Arrow
buffer
Output Arrow
buffer
Gandiva
compiler
Pre-compiled
functions (.bs)
OptimizeIRBuilder
25127
4.5x-90x Faster than Java-based Code Generation
Test Project time (secs)
with Java JIT
Project time (secs)
with Gandiva LLVM
Improvement
Sum 3.805 0.558 6.8x
Project 5 columns 8.681 1.689 5.13x
Project 10 columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.6x
25127
Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O
✓ Columnar cloud cache (C3) automatically provides
NVMe-level I/O performance when reading from
S3/ADLS
✓ Arrow persistence enables granular caching as Arrow
buffers in local engine NVMe
✓ Bypass data deserialization and decompression
✓ Enables high-concurrency, low-latency BI workloads
on cloud data lake storage
…
Executor Executor Executor Executor
AWS S3
NVMe NVMeNVMe NVMe
C3 with Apache Arrow persistence
…
Executor Executor Executor
NVMe NVMe NVMe
C3 with Apache Arrow persistence
XL engine
M engine
25127
The Open Data Platform
Storage
Data
Compute
Client
Interactive SQL & BI Data science & batch Occasional SQL
Athena EMR
Batch processing
AWS
S3
ADLS HDFS
File formats: Text | JSON | Parquet | ORC
Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
We Need Fast, Industry-Standard Data Exchange
Storage
Data
Compute
Client
Interactive SQL & BI Data science & batch Occasional SQL
Athena EMR
AWS
S3
ADLS HDFS
File formats: Text | JSON | Parquet | ORC
Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
Batch processing
2
1
3
4
Arrow Flight is an Arrow-based RPC Interface
✓ High-performance wire protocol
✓ Parallel streams of Arrow buffers are transferred
✓ Delivers on the interoperability promise of Apache
Arrow
✓ Client-cluster and cluster-cluster communication
…
Arrow Flight dataframe
Arrow Flight Python Client
import pyarrow.flight as flt
c = flt.FlightClient.connect("localhost", 47470)
fd = flt.FlightDescriptor.for_command(sql)
fi = c.get_flight_info(fd)
ticket = fi.endpoints[0].ticket
df = c.do_get(ticket0).read_all()
Client-Cluster Communication
Cluster-Cluster Communication
Demo
Demo
25127
Q&AThe Data Lake Engine
25127
Dremio is the Data Lake Engine
Data
Lake
Storage
Data
Lake
Engine
BI Users
SQL
Data Scientists
ADLS S3or or
Optional
External
Sources
Data
Users
Accelerate
Business
100X BI query speed
4X Ad-hoc query speed
0 cubes, extracts, or
aggregation tables
Reduce
Cost & Risk&
10x lower AWS EC2 /
Azure VM spend for same
performance
0 lock-in, loss of control,
and duplication of data
Powered by
A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage

More Related Content

What's hot (20)

PDF
Introduction to Apache Flink
datamantra
 
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
Apache airflow
Purna Chander
 
PPTX
Apache Arrow Flight Overview
Jacques Nadeau
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PPTX
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
PDF
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Apache Spark Core – Practical Optimization
Databricks
 
PDF
Airflow presentation
Ilias Okacha
 
PDF
Introduction to Apache Airflow
mutt_data
 
PDF
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
ScaleGrid.io
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PDF
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
PPTX
Airflow and supervisor
Rafael Roman Otero
 
PDF
Introduction to MLflow
Databricks
 
PPTX
Airflow 101
SaarBergerbest
 
Introduction to Apache Flink
datamantra
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Apache airflow
Purna Chander
 
Apache Arrow Flight Overview
Jacques Nadeau
 
Apache Airflow overview
NikolayGrishchenkov
 
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Archmage, Pinterest’s Real-time Analytics Platform on Druid
Imply
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Apache Spark Architecture
Alexey Grishchenko
 
Apache Spark Core – Practical Optimization
Databricks
 
Airflow presentation
Ilias Okacha
 
Introduction to Apache Airflow
mutt_data
 
What’s the Best PostgreSQL High Availability Framework? PAF vs. repmgr vs. Pa...
ScaleGrid.io
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Airflow and supervisor
Rafael Roman Otero
 
Introduction to MLflow
Databricks
 
Airflow 101
SaarBergerbest
 

Similar to Data Science Across Data Sources with Apache Arrow (20)

PPTX
Databricks Platform.pptx
Alex Ivy
 
PDF
Serverless Data Platform
Shu-Jeng Hsieh
 
PPTX
The Roadmap for SQL Server 2019
Amit Banerjee
 
PPTX
Cloud-based Data Lake for Analytics and AI
Torsten Steinbach
 
PPTX
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
PDF
Owning Your Own (Data) Lake House
Data Con LA
 
PDF
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
PPTX
Overview SQL Server 2019
Juan Fabian
 
PPTX
Dev/Test Environment Provisioning and Management on AWS
Shiva Narayanaswamy
 
PDF
Seminario de Cloud Computing na UFRRJ
Alex Barbosa Coqueiro
 
PPT
ArcReady - Architecting For The Cloud
Microsoft ArcReady
 
PDF
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
PPTX
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Ivan Donev
 
PDF
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
PDF
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Karim Vaes
 
PPTX
Solved: Your Most Dreaded Test Environment Management Challenges
DevOps.com
 
PDF
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
PDF
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
PPTX
Coud-based Data Lake for Analytics and AI
Torsten Steinbach
 
PPTX
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Databricks Platform.pptx
Alex Ivy
 
Serverless Data Platform
Shu-Jeng Hsieh
 
The Roadmap for SQL Server 2019
Amit Banerjee
 
Cloud-based Data Lake for Analytics and AI
Torsten Steinbach
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 
Owning Your Own (Data) Lake House
Data Con LA
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
Overview SQL Server 2019
Juan Fabian
 
Dev/Test Environment Provisioning and Management on AWS
Shiva Narayanaswamy
 
Seminario de Cloud Computing na UFRRJ
Alex Barbosa Coqueiro
 
ArcReady - Architecting For The Cloud
Microsoft ArcReady
 
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Ivan Donev
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
Amazon Web Services Korea
 
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Karim Vaes
 
Solved: Your Most Dreaded Test Environment Management Challenges
DevOps.com
 
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Lace Lofranco
 
Coud-based Data Lake for Analytics and AI
Torsten Steinbach
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
SaleServicereport and SaleServicereport
2251330007
 
DOCX
🧩 1. Solvent R-WPS Office work scientific
NohaSalah45
 
PDF
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
PPTX
covid 19 data analysis updates in our municipality
RhuAyungon1
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
PPTX
Krezentios memories in college data.pptx
notknown9
 
PDF
TESDA License NC II PC Operations TESDA, Office Productivity
MELJUN CORTES
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PDF
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Orchestrating Data Workloads With Airflow.pdf
ssuserae5511
 
big data eco system fundamentals of data science
arivukarasi
 
BinarySearchTree in datastructures in detail
kichokuttu
 
SaleServicereport and SaleServicereport
2251330007
 
🧩 1. Solvent R-WPS Office work scientific
NohaSalah45
 
IT GOVERNANCE 4-2 - Information System Security (1).pdf
mdirfanuddin1322
 
covid 19 data analysis updates in our municipality
RhuAyungon1
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
Monitoring Improvement ( Pomalaa Branch).pptx
fajarkunee
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
GOOGLE ADS (1).pdf THE ULTIMATE GUIDE TO
kushalkeshwanisou
 
Krezentios memories in college data.pptx
notknown9
 
TESDA License NC II PC Operations TESDA, Office Productivity
MELJUN CORTES
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 

Data Science Across Data Sources with Apache Arrow

  • 1. 25127 The Data Lake Engine Spark + AI Summit 2020 Data Science Across Data Sources with Apache Arrow
  • 2. 25127 Dremio is the Data Lake Engine CompanyTomer Shiran Co-Founder & CPO, Dremio [email protected] Powering the cloud data lakes of the world’s leading companies across all industries Creators of Over $100M raised Background
  • 3. 25127 Your Data Lake is Exploding, Yet Your Data Remains Inaccessible But… >100% YoY S3 Data Growth1 >50% of Data Will Live on Cloud Data Lake Storage by 20252 1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/ 2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data Data Lakes are becoming the primary place that data lands Consuming the data is too slow & too difficult SQL Data Consumers X X X S3ADLS S3ADLS or or
  • 4. 25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists Data Lake Storage ADLS S3
  • 5. 25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists 1 Brittle & complex ETL/ELT Data Lake Storage ADLS S3
  • 6. 25127 Data Movement is the Typical Workaround for Data Lake Storage 1 2 Brittle & complex ETL/ELT Data Lake Storage Proprietary & expensive DW/Data Marts BI Users SQL Data Scientists ADLS S3
  • 7. 25127 Data Movement is the Typical Workaround for Data Lake Storage Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility Data Lake Storage BI Users SQL Data Scientists ADLS S3
  • 8. 25127 Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility BI Users SQL Data Scientists Data Lake Storage ADLS S3o r o r Query data lake storage directly with 4-100X performance Powered by .
  • 9. What is Apache Arrow? Columnar In- Memory Representation Many Language Bindings Broad Industry Adoption Row-based Column-based
  • 11. 25127 Apache Arrow Gandiva Improves CPU Efficiency ✓ A standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow vectors using runtime code- generation in LLVM ✓ Expressions are compiled to LLVM bytecode (IR), optimized & translated to machine code ✓ Gandiva enables vectorized execution with Intel SIMD instructions SQL expression Vectorized execution kernel Input Arrow buffer Output Arrow buffer Gandiva compiler Pre-compiled functions (.bs) OptimizeIRBuilder
  • 12. 25127 4.5x-90x Faster than Java-based Code Generation Test Project time (secs) with Java JIT Project time (secs) with Gandiva LLVM Improvement Sum 3.805 0.558 6.8x Project 5 columns 8.681 1.689 5.13x Project 10 columns 24.923 3.476 7.74x CASE-10 4.308 0.925 4.66x CASE-100 1361 15.187 89.6x
  • 13. 25127 Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O ✓ Columnar cloud cache (C3) automatically provides NVMe-level I/O performance when reading from S3/ADLS ✓ Arrow persistence enables granular caching as Arrow buffers in local engine NVMe ✓ Bypass data deserialization and decompression ✓ Enables high-concurrency, low-latency BI workloads on cloud data lake storage … Executor Executor Executor Executor AWS S3 NVMe NVMeNVMe NVMe C3 with Apache Arrow persistence … Executor Executor Executor NVMe NVMe NVMe C3 with Apache Arrow persistence XL engine M engine
  • 14. 25127 The Open Data Platform Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR Batch processing AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
  • 15. We Need Fast, Industry-Standard Data Exchange Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg Batch processing 2 1 3 4
  • 16. Arrow Flight is an Arrow-based RPC Interface ✓ High-performance wire protocol ✓ Parallel streams of Arrow buffers are transferred ✓ Delivers on the interoperability promise of Apache Arrow ✓ Client-cluster and cluster-cluster communication … Arrow Flight dataframe
  • 17. Arrow Flight Python Client import pyarrow.flight as flt c = flt.FlightClient.connect("localhost", 47470) fd = flt.FlightDescriptor.for_command(sql) fi = c.get_flight_info(fd) ticket = fi.endpoints[0].ticket df = c.do_get(ticket0).read_all()
  • 20. Demo
  • 21. Demo
  • 23. 25127 Dremio is the Data Lake Engine Data Lake Storage Data Lake Engine BI Users SQL Data Scientists ADLS S3or or Optional External Sources Data Users Accelerate Business 100X BI query speed 4X Ad-hoc query speed 0 cubes, extracts, or aggregation tables Reduce Cost & Risk& 10x lower AWS EC2 / Azure VM spend for same performance 0 lock-in, loss of control, and duplication of data Powered by A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage