SlideShare a Scribd company logo
Apache Drill
 Architecture Outlines




     Jason Frantz
whoami
• Jason


• Software Architect at MapR


• Background in Databases, File Systems,
  Distributed Systems


• jfrantz@maprtech.com, jason@apache.org
Data Flow
Big Picture




• Same basic pieces as most databases:
  –   Driver: manage interaction with client
  –   Parser: process textual query language
  –   Compiler / Optimizer: convert logical query into physical plan
  –   Execution Engine: run physical plan across cluster
  –   Storage Handlers: feed user data in/out of execution
Parser
• Converts text-based query language into internal DAG representation
  – Grammar, syntax, basic query validation
  – Generally straightforward to implement


• Initial goal is to support a SQL-like query language for nested data (DrQL)
   – Compatible with Google BigQuery/Dremel
   – Designed to support data sources that have a well-defined schema (e.g.
     protocol buffers) as well as those that don't (e.g. JSON)

• Other potential input styles:
  – MongoDB's query language
  – Hive
  – Pig
Traditional Query Optimizers
• 30+ year history into relational query optimization
   – We have to follow down the same general path


• Converts a logical query plan into a physical one
  – Example: convert logical “JOIN” operator into specific hash join operator
  – Attempts to choose the “best” overall execution plan


• Magic black box of statistics!
  – Optimizers do great with queries that can be easily modeled with
    available statistics
  – Difficulties: lack of statistics, complex schemas, complex queries
  – Database users often work around optimizer using query hints
    ● “force index”
Intermediate Representation
• Intermediate Representation (IR) is common internal API
   – Output from Parser
   – Input/Output from Optimizer
   – Input to Execution Engine


• Textual Representation:
   – Flexibility
     ● Different users can enter at different levels of the IR

     ● Advanced users can skip optimizer entirely


   – Easier to test various pieces
   – Easy to cache
     ● Query optimization can be computationally expensive, so traditional databases go to


       great lengths to reuse execution plans

• Ideally IR would be format used between optimization passes
   – Inspiration: LLVM, SQL Server showplan
Execution Engine
• Execution layer
   – Query is a DAG of operators

• Operator layer
   – Implementation of individual operators and data format
     serialization
Execution Layer
• Query structured as a Directed Acyclic Graph (DAG) representing the data flow
  – Each node is an abstract “operator”
  – Communication between nodes is “blobs” of data
  – Data model described well in Microsoft's Dryad paper (Isard '07)


• Responsible for handling:
  – Operator dependencies
  – Task scheduling
  – Inter-node communication


• Notable features:
  – Speculative execution
  – Pipelining with spill-to-disk as fallback
  – Back pressure
Operator Execution
• Implementation of individual operators
   – Example built-in operators: hash aggregate, filter, json-scan
   – Extensible so new operators are easy to plug in


• Serialization-aware:
   – Each “blob” is a batch of rows in a particular format:
     ● Row-wise, no schema: MessagePack

     ● Row-wise, schema: Protocol Buffers

     ● Columnar, schema: Dremel-style format


   – Different operator implementations for different serializations
Storage Interfaces
• Scanner operators
   – Common APIs to convert user data into formats understood by
     execution operators
   – Example conversions:
     ● JSON → MessagePack

     ● CSV → MessagePack

     ● Dremel: columnar serialization → Protocol Buffers




• Data sources:
  – HDFS
  – NFS
  – HBase / Cassandra
  – MySQL / PostgreSQL / etc
Storage Interfaces
• Scanner Flexibility:
   – Allow in-place filtering (predicate pushdown)
   – Scanners can manage their own caching policies for their
     data

• In-place processing
   – Having a separate “ETL” step is painful
     ● Easiest to process data on demand


   – Query workload gives feedback on scanner access patterns
     ● Database Cracking: adaptively convert storage layout into


       more efficient forms
Design Principles
Flexible                          Easy
•   Pluggable query languages     •   Unzip and run
•   Extensible execution engine   •   Zero configuration
•   Pluggable data formats        •   Reverse DNS not needed
•   Column-based and row-based    •   IP addresses can change
•   Schema and schema-less        •   Clear and concise log
•   Pluggable data sources            messages



Dependable                        Fast
• No SPOF                         • C/C++ core with Java support
• Instant recovery from crashes   • Google C++ style guide
                                  • Min latency and max
                                    throughput (limited only by
                                    hardware)
Ad

More Related Content

What's hot (20)

HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
Omid Vahdaty
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
Roopa Tangirala
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
Maturin BADO
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
Grzegorz Kokosiński
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
Shift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraShift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to Cassandra
DataStax
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
Membase
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
Johan Gustavsson
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Michael Stack
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 
Kafka website activity architecture
Kafka website activity architectureKafka website activity architecture
Kafka website activity architecture
Omid Vahdaty
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
Roopa Tangirala
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
Maturin BADO
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
HBaseCon
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
Databricks
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
DataWorks Summit
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
Omid Vahdaty
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Data Con LA
 
Shift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraShift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to Cassandra
DataStax
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
Membase
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward
 
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
HBaseConAsia2018 Track2-6: Scaling 30TB's of data lake with Apache HBase and ...
Michael Stack
 

Viewers also liked (20)

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
Maria Stylianou
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
Vicente Orjales
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
jasonfrantz
 
[UniteKorea2013] Serialization in Depth
[UniteKorea2013] Serialization in Depth[UniteKorea2013] Serialization in Depth
[UniteKorea2013] Serialization in Depth
William Hugo Yang
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
Gabriel Hamilton
 
Futures Trading Strategies on SGX - India chapter in AFACT in Singapore
Futures Trading Strategies on SGX - India chapter in AFACT in SingaporeFutures Trading Strategies on SGX - India chapter in AFACT in Singapore
Futures Trading Strategies on SGX - India chapter in AFACT in Singapore
QuantInsti
 
Trading system in stock exchange
Trading system in stock exchangeTrading system in stock exchange
Trading system in stock exchange
Sumit Behura
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 
Mechanical trading system based on renko charts
Mechanical trading system based on renko chartsMechanical trading system based on renko charts
Mechanical trading system based on renko charts
Raul Canessa
 
Application design for MiFID II-compliant operations
Application design for MiFID II-compliant operationsApplication design for MiFID II-compliant operations
Application design for MiFID II-compliant operations
László Árvai
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
QuantInsti
 
How to build a trading system
How to build a trading systemHow to build a trading system
How to build a trading system
FXstreet.com
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)
Sri Prasanna
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Mutual Exclusion Election (Distributed computing)
Mutual Exclusion Election (Distributed computing)Mutual Exclusion Election (Distributed computing)
Mutual Exclusion Election (Distributed computing)
Sri Prasanna
 
EXTENT-2015: MiFID II Projected Impact on Trading Technology
EXTENT-2015: MiFID II Projected Impact on Trading TechnologyEXTENT-2015: MiFID II Projected Impact on Trading Technology
EXTENT-2015: MiFID II Projected Impact on Trading Technology
Iosif Itkin
 
GBDTを使ったfeature transformationの適用例
GBDTを使ったfeature transformationの適用例GBDTを使ったfeature transformationの適用例
GBDTを使ったfeature transformationの適用例
Takanori Nakai
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
Vicente Orjales
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
jasonfrantz
 
[UniteKorea2013] Serialization in Depth
[UniteKorea2013] Serialization in Depth[UniteKorea2013] Serialization in Depth
[UniteKorea2013] Serialization in Depth
William Hugo Yang
 
How BigQuery broke my heart
How BigQuery broke my heartHow BigQuery broke my heart
How BigQuery broke my heart
Gabriel Hamilton
 
Futures Trading Strategies on SGX - India chapter in AFACT in Singapore
Futures Trading Strategies on SGX - India chapter in AFACT in SingaporeFutures Trading Strategies on SGX - India chapter in AFACT in Singapore
Futures Trading Strategies on SGX - India chapter in AFACT in Singapore
QuantInsti
 
Trading system in stock exchange
Trading system in stock exchangeTrading system in stock exchange
Trading system in stock exchange
Sumit Behura
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 
Mechanical trading system based on renko charts
Mechanical trading system based on renko chartsMechanical trading system based on renko charts
Mechanical trading system based on renko charts
Raul Canessa
 
Application design for MiFID II-compliant operations
Application design for MiFID II-compliant operationsApplication design for MiFID II-compliant operations
Application design for MiFID II-compliant operations
László Árvai
 
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
Technology Edge in Algo Trading: Traditional Vs Automated Trading System Arch...
QuantInsti
 
How to build a trading system
How to build a trading systemHow to build a trading system
How to build a trading system
FXstreet.com
 
Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)Logical Clocks (Distributed computing)
Logical Clocks (Distributed computing)
Sri Prasanna
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
Jen Aman
 
Mutual Exclusion Election (Distributed computing)
Mutual Exclusion Election (Distributed computing)Mutual Exclusion Election (Distributed computing)
Mutual Exclusion Election (Distributed computing)
Sri Prasanna
 
EXTENT-2015: MiFID II Projected Impact on Trading Technology
EXTENT-2015: MiFID II Projected Impact on Trading TechnologyEXTENT-2015: MiFID II Projected Impact on Trading Technology
EXTENT-2015: MiFID II Projected Impact on Trading Technology
Iosif Itkin
 
GBDTを使ったfeature transformationの適用例
GBDTを使ったfeature transformationの適用例GBDTを使ったfeature transformationの適用例
GBDTを使ったfeature transformationの適用例
Takanori Nakai
 
Ad

Similar to Drill architecture 20120913 (20)

Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
Alexei Krasner
 
Drop acid
Drop acidDrop acid
Drop acid
Mike Feltman
 
Hadoop
HadoopHadoop
Hadoop
Girish Khanzode
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
Girish Khanzode
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
Yahoo Developer Network
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Bigdatapump
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
Ted Dunning
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
MapR Technologies
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
MapR Technologies
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
Lior Sidi
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
Alexei Krasner
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
Yahoo Developer Network
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
Ted Dunning
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Ad

Recently uploaded (20)

#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 

Drill architecture 20120913

  • 1. Apache Drill Architecture Outlines Jason Frantz
  • 2. whoami • Jason • Software Architect at MapR • Background in Databases, File Systems, Distributed Systems • [email protected], [email protected]
  • 4. Big Picture • Same basic pieces as most databases: – Driver: manage interaction with client – Parser: process textual query language – Compiler / Optimizer: convert logical query into physical plan – Execution Engine: run physical plan across cluster – Storage Handlers: feed user data in/out of execution
  • 5. Parser • Converts text-based query language into internal DAG representation – Grammar, syntax, basic query validation – Generally straightforward to implement • Initial goal is to support a SQL-like query language for nested data (DrQL) – Compatible with Google BigQuery/Dremel – Designed to support data sources that have a well-defined schema (e.g. protocol buffers) as well as those that don't (e.g. JSON) • Other potential input styles: – MongoDB's query language – Hive – Pig
  • 6. Traditional Query Optimizers • 30+ year history into relational query optimization – We have to follow down the same general path • Converts a logical query plan into a physical one – Example: convert logical “JOIN” operator into specific hash join operator – Attempts to choose the “best” overall execution plan • Magic black box of statistics! – Optimizers do great with queries that can be easily modeled with available statistics – Difficulties: lack of statistics, complex schemas, complex queries – Database users often work around optimizer using query hints ● “force index”
  • 7. Intermediate Representation • Intermediate Representation (IR) is common internal API – Output from Parser – Input/Output from Optimizer – Input to Execution Engine • Textual Representation: – Flexibility ● Different users can enter at different levels of the IR ● Advanced users can skip optimizer entirely – Easier to test various pieces – Easy to cache ● Query optimization can be computationally expensive, so traditional databases go to great lengths to reuse execution plans • Ideally IR would be format used between optimization passes – Inspiration: LLVM, SQL Server showplan
  • 8. Execution Engine • Execution layer – Query is a DAG of operators • Operator layer – Implementation of individual operators and data format serialization
  • 9. Execution Layer • Query structured as a Directed Acyclic Graph (DAG) representing the data flow – Each node is an abstract “operator” – Communication between nodes is “blobs” of data – Data model described well in Microsoft's Dryad paper (Isard '07) • Responsible for handling: – Operator dependencies – Task scheduling – Inter-node communication • Notable features: – Speculative execution – Pipelining with spill-to-disk as fallback – Back pressure
  • 10. Operator Execution • Implementation of individual operators – Example built-in operators: hash aggregate, filter, json-scan – Extensible so new operators are easy to plug in • Serialization-aware: – Each “blob” is a batch of rows in a particular format: ● Row-wise, no schema: MessagePack ● Row-wise, schema: Protocol Buffers ● Columnar, schema: Dremel-style format – Different operator implementations for different serializations
  • 11. Storage Interfaces • Scanner operators – Common APIs to convert user data into formats understood by execution operators – Example conversions: ● JSON → MessagePack ● CSV → MessagePack ● Dremel: columnar serialization → Protocol Buffers • Data sources: – HDFS – NFS – HBase / Cassandra – MySQL / PostgreSQL / etc
  • 12. Storage Interfaces • Scanner Flexibility: – Allow in-place filtering (predicate pushdown) – Scanners can manage their own caching policies for their data • In-place processing – Having a separate “ETL” step is painful ● Easiest to process data on demand – Query workload gives feedback on scanner access patterns ● Database Cracking: adaptively convert storage layout into more efficient forms
  • 13. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log • Pluggable data sources messages Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)