SlideShare a Scribd company logo
Real Time Analytics
at Uber: Bring SQL
into Everything
Zhenxiao Luo
NYC
Uber’s mission is to
ignite opportunity by
setting the world in
motion.
15M
Trips/Day
600+
Cities
75M
Monthly Riders
Data informs every decision at the company
Overview of Uber’s Data Platform
DATA SOURCES
RAW DATA
MODELED TABLES
MINING BUSINESS
INSIGHTS
CONSUMING BUSINESS INSIGHTS
EXPERIMENTATION
DATA SCIENCE
MACHINE
LEARNING
CUSTOM DATA SETS
Dashboarding
Alerting
Monitoring
Data Exploration
Knowledge Bases
Storage
Infrastructure
ETL Frameworks
Data Integrity
Query Engines
Kafka
Uber Data Infrastructure
Schemaless
MySQL,
Postgres
Vertica
Streamio
Raw
Data
Raw
Tables
Sqoop
Reports
Hadoop
Hive Presto Spark
Notebook Ad Hoc Queries
Real Time
Applications
Machine
Learning Jobs
Business
Intelligence Jobs
Cluster
Management
All-Active
Observability
Security
Vertica
Samza
Pinot
Flink
AresDB
Modeled
Tables
Streaming
Warehouse
Real-time
Presto @ Uber-scale
5KWeekly Active Users
160KQueries/day
3Data Centers
2KNodes
700MHDFS ïŹles read/day
10PBHDFS ïŹles
processed/day
Presto use cases at Uber
Growth Marketing
Data Science
Marketplace
Pricing
Community
Operations
Data Quality
Ad-hoc Querying
The people who rely on us
Technical
Skills
Data Scientists
Software Engineers
ML/AI Researchers
Advanced SQL
Advanced Statistics
Scala/Spark, Python/R
Data Modeling
Inventor Ivan
Marketing Managers
Entry-level Analysts
General Managers
Product Managers
Limited SQL
Spreadsheets
Reliant Rebecca
City Operations
Regional Managers
Intermediate SQL
Spreadsheets
Dashboarding
Monitoring Matt
Operations Managers
Data Analysts
Product Analysts
Advanced SQL
Spreadsheets
Limited Statistics
Limited Python/R
Analyst Anna
Exploratory ML &
model-training
Data Scientists ML ResearchersEngineers
Using ML to ensure data
security and compliance
Advanced data
science &
complex analytics
Data Scientists Ops Analysts Support Agents
Surfacing hidden insights
to empower restaurants
Business process
automation
S&P AnalystsOps Managers Contractors
Using technology to make
transportation safer
What is Presto: Interactive SQL
Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested by Facebook, Uber, Linkedin, Twitter, NetïŹ‚ix, Airbnb, etc
Completely open source
Access to petabytes of data in the Hadoop, Elasticsearch, Pinot, etc.
How Presto Works
Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode generation
Resource Management
● Presto has its own resource manager
○ Not on YARN
○ Not on Mesos
● CPU Management
○ Priority queues
○ Short running queries higher priority
● Memory Management
○ Max memory per query per node
○ If query exceeds max memory limit, query fails
○ No OutOfMemory in Presto process
Presto Connectors:
No Need to Copy Data
Uber Contributions
Contributions
New Features
● Geospatial indexing and operations - 10x or more speedup
● Pinot connector enhancements (in-house)
Optimizations
● Elasticsearch connector
● New Parquet reader - 4x speedup
● Nested column pushdowns (project, predicate) - 10x speedup
Security
● Metastore authentication support for Kerberos deployments
● Dispatch Proxy using HTTP redirect for multi-cluster operation
Presto Connector Interface
● ConnectorMetadata
○ Schema, Table, Column
● ConnectorSplitManager
○ Divide data into splits
● ConnectorSplit
○ Split data range
○ Predicate/JsonFunction/Limit pushdown
● ConnectorRecordCursor
○ Transform underlying storage data into Presto internal
page/block
Presto Elasticsearch Connector
Data Model
● each Elasticsearch index is a table partition
● each ïŹeld of an index is a column
● all Elasticsearch indexes sharing the same preïŹx
consist a logical table
○ Es-vehicles-sjc1, es-vehicles-dca1, es-vehicles
Describe Table
Query
Optimizations
● Parallel Reads
○ Get all indices and search nodes
○ For each search node, send request for one speciïŹc index
● Cap Max Hits
● Predicate Pushdown
● Json Function Pushdown
● Limit Pushdown
● Nested Fields
How many Uber trip
requests did we serve
in Chicago yesterday?
Fetch daily trip count in seconds
SELECT T.base.city_id AS cid,
Count(CASE WHEN T.base.status = 'completed' THEN 1 END) AS
completed_trips,
Count(CASE WHEN T.base.status = 'canceled' THEN 1 END) AS
rider_canceled_trips
FROM trips AS T
WHERE T.datestr = '2019-03-11'
GROUP BY 1
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.status
Column Chunk
base.vehicle_id
Column Chunk
base.city_idRow Group
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.city_id
Column Chunk
base.vehicle_id
Column Chunk
base.statusRow Group
Parquet
Parquet Footer: File Metadata, Row Group Metadata
Step 1: Read all Parquet nested ïŹelds from disk
base.driver_uuid base.client_uuid base.city_id 
... base.vehicle_id base.status
base.driver_uuid
base.driver_uuid
base.driver_uuid
base.driver_uuid base.client_uuid base.city_id 
... base.vehicle_id base.status
Presto Columnar Engine
Step 2: Transform Parquet rows into Presto columnar blocks
Step 3: Evaluate predicates on columnar blocks
base.client_uuid
base.client_uuid
base.client_uuid
base.city_id
base.city_id
base.city_id
base.vehicle_id
base.vehicle_id
base.vehicle_id
base.status
base.status
base.status

.
Default Apache Parquet Reader
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.status
Column Chunk
base.vehicle_id
Column Chunk
base.city_id
Row Group
Column Chunk
base.client_uuid
Column Chunk
base.driver_uuid
Column Chunk
base.city_id
Column Chunk
base.vehicle_id
Column Chunk
base.status
Row Group
Parquet Footer: File Metadata, Row Group Metadata
Step 1: Read ONLY Required nested ïŹelds from disk
Presto Columnar Engine
Apache Parquet Reader Optimization
base.driver_uuid
base.driver_uuid
base.driver_uuid
base.city_id
base.city_id
base.city_id
Step 1: Read ONLY Required nested ïŹelds from disk
Evaluate predicates on the ïŹ‚y:
Skip reading row group;
predicate: base.city_id = 12
dictionary: base.city_id: {3,
5, 9, 14, 21}
Build columnar blocks only
for predicate matches
Step 2. Build columnar blocks on the ïŹ‚y
base.driver_uuid
base.driver_uuid
base.driver_uuid
Step 3: Evaluate predicates on columnar blocks
Parquet
Results
Looking forward
Federated SQL Layer
Vision
HDFS
VerticaElasticsearch
Apache
Pinot
MySQL
Machines
Reports
Users
Presto RealTime Presto
Proxy layer
Management
Universal
Metadata
Service
Focus areas
Connectors
● Apache Hive, Apache Pinot, Elasticsearch, Apache Cassandra, Vertica, MySQL, etc
● Aggregation / Join pushdown
● Cross-connector optimizations (hybrid connectors)
Real-time
● Real-time mode with low latency pass through
● Query plan / result / data cache
● Time-series joins and stitching
Universal Metadata Service (UMS)
● Logical deïŹnitions / physical schemas
● Column stitching and joins
● Table and partition caching
Thank you
Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval systems, without permission in writing from
Uber. This document is intended only for the use of the individual or entity to
whom it is addressed. All recipients of this document are notiïŹed that the
information contained herein includes proprietary information of Uber, and
recipient may not make use of, disseminate, or in any way disclose this
document or any of the enclosed information to any person other than
employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
Ad

More Related Content

What's hot (20)

Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
Amr Alaa Yassen
 
NATS Connect Live!
NATS Connect Live!NATS Connect Live!
NATS Connect Live!
NATS
 
Bigquery와 airflow넌 읎용한 데읎터 분석 시슀템 ê”Źì¶• v1 ë‚˜ëŹŽêž°ìˆ (ìŁŒ) 씜유석 20170912
Bigquery와 airflow넌 읎용한 데읎터 분석 시슀템 ê”Źì¶• v1  ë‚˜ëŹŽêž°ìˆ (ìŁŒ) 씜유석 20170912Bigquery와 airflow넌 읎용한 데읎터 분석 시슀템 ê”Źì¶• v1  ë‚˜ëŹŽêž°ìˆ (ìŁŒ) 씜유석 20170912
Bigquery와 airflow넌 읎용한 데읎터 분석 시슀템 ê”Źì¶• v1 ë‚˜ëŹŽêž°ìˆ (ìŁŒ) 씜유석 20170912
Yooseok Choi
 
Distributed SQL Databases Deconstructed
Distributed SQL Databases DeconstructedDistributed SQL Databases Deconstructed
Distributed SQL Databases Deconstructed
Yugabyte
 
Mongodb íŠč징 분석
Mongodb íŠč징 분석Mongodb íŠč징 분석
Mongodb íŠč징 분석
Daeyong Shin
 
Naver속도의, 속도에 의한, 속도넌 위한 ëȘœêł DB (넀읎ëȄ 컚텐잠êČ€ìƒ‰êłŒ ëȘœêł DB) [Naver]
Naver속도의, 속도에 의한, 속도넌 위한 ëȘœêł DB (넀읎ëȄ 컚텐잠êČ€ìƒ‰êłŒ ëȘœêł DB) [Naver]Naver속도의, 속도에 의한, 속도넌 위한 ëȘœêł DB (넀읎ëȄ 컚텐잠êČ€ìƒ‰êłŒ ëȘœêł DB) [Naver]
Naver속도의, 속도에 의한, 속도넌 위한 ëȘœêł DB (넀읎ëȄ 컚텐잠êČ€ìƒ‰êłŒ ëȘœêł DB) [Naver]
MongoDB
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
MongoDB
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
Araf Karsh Hamid
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Data Engineering 101
Data Engineering 101Data Engineering 101
Data Engineering 101
DaeMyung Kang
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PgDay.Seoul
 
대용량 분산 ì•„í‚€í…ìł ì„€êł„ #3 대용량 분산 시슀템 ì•„í‚€í…ìł
대용량 분산 ì•„í‚€í…ìł ì„€êł„ #3 대용량 분산 시슀템 ì•„í‚€í…ìłëŒ€ìš©ëŸ‰ 분산 ì•„í‚€í…ìł ì„€êł„ #3 대용량 분산 시슀템 ì•„í‚€í…ìł
대용량 분산 ì•„í‚€í…ìł ì„€êł„ #3 대용량 분산 시슀템 ì•„í‚€í…ìł
Terry Cho
 
[NDC18] 알생의 땅 ë“€ëž‘êł ì˜ 데읎터 엔지니얎링 읎알Ʞ: ëĄœê·ž 시슀템 ê”Źì¶• êČœí—˜ êł”ìœ  (2부)
[NDC18] 알생의 땅 ë“€ëž‘êł ì˜ 데읎터 엔지니얎링 읎알Ʞ: ëĄœê·ž 시슀템 ê”Źì¶• êČœí—˜ êł”ìœ  (2부)[NDC18] 알생의 땅 ë“€ëž‘êł ì˜ 데읎터 엔지니얎링 읎알Ʞ: ëĄœê·ž 시슀템 ê”Źì¶• êČœí—˜ êł”ìœ  (2부)
[NDC18] 알생의 땅 ë“€ëž‘êł ì˜ 데읎터 엔지니얎링 읎알Ʞ: ëĄœê·ž 시슀템 ê”Źì¶• êČœí—˜ êł”ìœ  (2부)
Hyojun Jeon
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
JWORKS powered by Ordina
 
Apache Kafka in the Telco Industry (OSS, BSS, OTT, IMS, NFV, Middleware, Main...
Apache Kafka in the Telco Industry (OSS, BSS, OTT, IMS, NFV, Middleware, Main...Apache Kafka in the Telco Industry (OSS, BSS, OTT, IMS, NFV, Middleware, Main...
Apache Kafka in the Telco Industry (OSS, BSS, OTT, IMS, NFV, Middleware, Main...
Kai WĂ€hner
 
A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?
confluent
 
Fluentd and Kafka
Fluentd and KafkaFluentd and Kafka
Fluentd and Kafka
N Masahiro
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
Amr Alaa Yassen
 
NATS Connect Live!
NATS Connect Live!NATS Connect Live!
NATS Connect Live!
NATS
 
Bigquery와 airflow넌 읎용한 데읎터 분석 시슀템 ê”Źì¶• v1 ë‚˜ëŹŽêž°ìˆ (ìŁŒ) 씜유석 20170912
Bigquery와 airflow넌 읎용한 데읎터 분석 시슀템 ê”Źì¶• v1  ë‚˜ëŹŽêž°ìˆ (ìŁŒ) 씜유석 20170912Bigquery와 airflow넌 읎용한 데읎터 분석 시슀템 ê”Źì¶• v1  ë‚˜ëŹŽêž°ìˆ (ìŁŒ) 씜유석 20170912
Bigquery와 airflow넌 읎용한 데읎터 분석 시슀템 ê”Źì¶• v1 ë‚˜ëŹŽêž°ìˆ (ìŁŒ) 씜유석 20170912
Yooseok Choi
 
Distributed SQL Databases Deconstructed
Distributed SQL Databases DeconstructedDistributed SQL Databases Deconstructed
Distributed SQL Databases Deconstructed
Yugabyte
 
Mongodb íŠč징 분석
Mongodb íŠč징 분석Mongodb íŠč징 분석
Mongodb íŠč징 분석
Daeyong Shin
 
Naver속도의, 속도에 의한, 속도넌 위한 ëȘœêł DB (넀읎ëȄ 컚텐잠êČ€ìƒ‰êłŒ ëȘœêł DB) [Naver]
Naver속도의, 속도에 의한, 속도넌 위한 ëȘœêł DB (넀읎ëȄ 컚텐잠êČ€ìƒ‰êłŒ ëȘœêł DB) [Naver]Naver속도의, 속도에 의한, 속도넌 위한 ëȘœêł DB (넀읎ëȄ 컚텐잠êČ€ìƒ‰êłŒ ëȘœêł DB) [Naver]
Naver속도의, 속도에 의한, 속도넌 위한 ëȘœêł DB (넀읎ëȄ 컚텐잠êČ€ìƒ‰êłŒ ëȘœêł DB) [Naver]
MongoDB
 
Sizing Your MongoDB Cluster
Sizing Your MongoDB ClusterSizing Your MongoDB Cluster
Sizing Your MongoDB Cluster
MongoDB
 
Microservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and KafkaMicroservices Part 3 Service Mesh and Kafka
Microservices Part 3 Service Mesh and Kafka
Araf Karsh Hamid
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Data Engineering 101
Data Engineering 101Data Engineering 101
Data Engineering 101
DaeMyung Kang
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PgDay.Seoul
 
대용량 분산 ì•„í‚€í…ìł ì„€êł„ #3 대용량 분산 시슀템 ì•„í‚€í…ìł
대용량 분산 ì•„í‚€í…ìł ì„€êł„ #3 대용량 분산 시슀템 ì•„í‚€í…ìłëŒ€ìš©ëŸ‰ 분산 ì•„í‚€í…ìł ì„€êł„ #3 대용량 분산 시슀템 ì•„í‚€í…ìł
대용량 분산 ì•„í‚€í…ìł ì„€êł„ #3 대용량 분산 시슀템 ì•„í‚€í…ìł
Terry Cho
 
[NDC18] 알생의 땅 ë“€ëž‘êł ì˜ 데읎터 엔지니얎링 읎알Ʞ: ëĄœê·ž 시슀템 ê”Źì¶• êČœí—˜ êł”ìœ  (2부)
[NDC18] 알생의 땅 ë“€ëž‘êł ì˜ 데읎터 엔지니얎링 읎알Ʞ: ëĄœê·ž 시슀템 ê”Źì¶• êČœí—˜ êł”ìœ  (2부)[NDC18] 알생의 땅 ë“€ëž‘êł ì˜ 데읎터 엔지니얎링 읎알Ʞ: ëĄœê·ž 시슀템 ê”Źì¶• êČœí—˜ êł”ìœ  (2부)
[NDC18] 알생의 땅 ë“€ëž‘êł ì˜ 데읎터 엔지니얎링 읎알Ʞ: ëĄœê·ž 시슀템 ê”Źì¶• êČœí—˜ êł”ìœ  (2부)
Hyojun Jeon
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Apache Kafka in the Telco Industry (OSS, BSS, OTT, IMS, NFV, Middleware, Main...
Apache Kafka in the Telco Industry (OSS, BSS, OTT, IMS, NFV, Middleware, Main...Apache Kafka in the Telco Industry (OSS, BSS, OTT, IMS, NFV, Middleware, Main...
Apache Kafka in the Telco Industry (OSS, BSS, OTT, IMS, NFV, Middleware, Main...
Kai WĂ€hner
 
A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?A Kafka journey and why migrate to Confluent Cloud?
A Kafka journey and why migrate to Confluent Cloud?
confluent
 
Fluentd and Kafka
Fluentd and KafkaFluentd and Kafka
Fluentd and Kafka
N Masahiro
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 

Similar to Real time analytics at uber @ strata data 2019 (20)

Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
Presto
PrestoPresto
Presto
Knoldus Inc.
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
kbajda
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
kiran palaka
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
Amit Banerjee
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
EDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container PlatformsEDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container Platforms
Ashnikbiz
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform Overview
Neo4j
 
Day 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analyticsDay 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
DriverPack Solution Download Full ISO free
DriverPack Solution Download Full ISO freeDriverPack Solution Download Full ISO free
DriverPack Solution Download Full ISO free
blouch112kp
 
Adobe After Effects 2025 v25.1.0 Free Download
Adobe After Effects 2025 v25.1.0 Free DownloadAdobe After Effects 2025 v25.1.0 Free Download
Adobe After Effects 2025 v25.1.0 Free Download
alihamzakpa070
 
Atlantis Word Processor 4.4.5.1 Free Download
Atlantis Word Processor 4.4.5.1 Free DownloadAtlantis Word Processor 4.4.5.1 Free Download
Atlantis Word Processor 4.4.5.1 Free Download
shanbahikp01
 
iTop VPN Crack 6.3.3 serial Key Free 2025
iTop VPN Crack 6.3.3 serial Key Free 2025iTop VPN Crack 6.3.3 serial Key Free 2025
iTop VPN Crack 6.3.3 serial Key Free 2025
blouch86kp
 
Neo4j Vision and Roadmap
Neo4j Vision and Roadmap Neo4j Vision and Roadmap
Neo4j Vision and Roadmap
Neo4j
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Mars Lan
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
kbajda
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
kiran palaka
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
Amit Banerjee
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
EDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container PlatformsEDB Postgres in DBaaS & Container Platforms
EDB Postgres in DBaaS & Container Platforms
Ashnikbiz
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform Overview
Neo4j
 
Day 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analyticsDay 1 - Technical Bootcamp azure synapse analytics
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
kgshukla
 
DriverPack Solution Download Full ISO free
DriverPack Solution Download Full ISO freeDriverPack Solution Download Full ISO free
DriverPack Solution Download Full ISO free
blouch112kp
 
Adobe After Effects 2025 v25.1.0 Free Download
Adobe After Effects 2025 v25.1.0 Free DownloadAdobe After Effects 2025 v25.1.0 Free Download
Adobe After Effects 2025 v25.1.0 Free Download
alihamzakpa070
 
Atlantis Word Processor 4.4.5.1 Free Download
Atlantis Word Processor 4.4.5.1 Free DownloadAtlantis Word Processor 4.4.5.1 Free Download
Atlantis Word Processor 4.4.5.1 Free Download
shanbahikp01
 
iTop VPN Crack 6.3.3 serial Key Free 2025
iTop VPN Crack 6.3.3 serial Key Free 2025iTop VPN Crack 6.3.3 serial Key Free 2025
iTop VPN Crack 6.3.3 serial Key Free 2025
blouch86kp
 
Neo4j Vision and Roadmap
Neo4j Vision and Roadmap Neo4j Vision and Roadmap
Neo4j Vision and Roadmap
Neo4j
 
Ad

More from Zhenxiao Luo (10)

Presto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitPresto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto Summit
Zhenxiao Luo
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks Summit
Zhenxiao Luo
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
Zhenxiao Luo
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
Zhenxiao Luo
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
Zhenxiao Luo
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
Zhenxiao Luo
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 
Presto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto SummitPresto Elasticsearch Connector at Presto Summit
Presto Elasticsearch Connector at Presto Summit
Zhenxiao Luo
 
Uber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks SummitUber Geo spatial data platform at DataWorks Summit
Uber Geo spatial data platform at DataWorks Summit
Zhenxiao Luo
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017Presto GeoSpatial @ Strata New York 2017
Presto GeoSpatial @ Strata New York 2017
Zhenxiao Luo
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
Zhenxiao Luo
 
Presto Apache BigData 2017
Presto Apache BigData 2017Presto Apache BigData 2017
Presto Apache BigData 2017
Zhenxiao Luo
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
Zhenxiao Luo
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
Zhenxiao Luo
 
Ad

Recently uploaded (19)

OSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description fOSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description f
cbr49917
 
(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security
aluacharya169
 
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
DataProvider1
 
Computers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers NetworksComputers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers Networks
Tito208863
 
Perguntas dos animais - Slides ilustrados de mĂșltipla escolha
Perguntas dos animais - Slides ilustrados de mĂșltipla escolhaPerguntas dos animais - Slides ilustrados de mĂșltipla escolha
Perguntas dos animais - Slides ilustrados de mĂșltipla escolha
socaslev
 
DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)
APNIC
 
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC
 
Determining Glass is mechanical textile
Determining  Glass is mechanical textileDetermining  Glass is mechanical textile
Determining Glass is mechanical textile
Azizul Hakim
 
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC
 
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 SupportReliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
steve198109
 
highend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptxhighend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptx
elhadjcheikhdiop
 
White and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptxWhite and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptx
canumatown
 
Understanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep WebUnderstanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep Web
nabilajabin35
 
Best web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you businessBest web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you business
steve198109
 
5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx
andani26
 
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingTop Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
steve198109
 
Smart Mobile App Pitch DeckäžšAI Travel App Presentation Template
Smart Mobile App Pitch DeckäžšAI Travel App Presentation TemplateSmart Mobile App Pitch DeckäžšAI Travel App Presentation Template
Smart Mobile App Pitch DeckäžšAI Travel App Presentation Template
yojeari421237
 
project_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptxproject_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptx
redzuriel13
 
IT Services Workflow From Request to Resolution
IT Services Workflow From Request to ResolutionIT Services Workflow From Request to Resolution
IT Services Workflow From Request to Resolution
mzmziiskd
 
OSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description fOSI TCP IP Protocol Layers description f
OSI TCP IP Protocol Layers description f
cbr49917
 
(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security(Hosting PHising Sites) for Cryptography and network security
(Hosting PHising Sites) for Cryptography and network security
aluacharya169
 
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...Mobile database for your company telemarketing or sms marketing campaigns. Fr...
Mobile database for your company telemarketing or sms marketing campaigns. Fr...
DataProvider1
 
Computers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers NetworksComputers Networks Computers Networks Computers Networks
Computers Networks Computers Networks Computers Networks
Tito208863
 
Perguntas dos animais - Slides ilustrados de mĂșltipla escolha
Perguntas dos animais - Slides ilustrados de mĂșltipla escolhaPerguntas dos animais - Slides ilustrados de mĂșltipla escolha
Perguntas dos animais - Slides ilustrados de mĂșltipla escolha
socaslev
 
DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)DNS Resolvers and Nameservers (in New Zealand)
DNS Resolvers and Nameservers (in New Zealand)
APNIC
 
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry SweetserAPNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC Update, presented at NZNOG 2025 by Terry Sweetser
APNIC
 
Determining Glass is mechanical textile
Determining  Glass is mechanical textileDetermining  Glass is mechanical textile
Determining Glass is mechanical textile
Azizul Hakim
 
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC -Policy Development Process, presented at Local APIGA Taiwan 2025
APNIC
 
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 SupportReliable Vancouver Web Hosting with Local Servers & 24/7 Support
Reliable Vancouver Web Hosting with Local Servers & 24/7 Support
steve198109
 
highend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptxhighend-srxseries-services-gateways-customer-presentation.pptx
highend-srxseries-services-gateways-customer-presentation.pptx
elhadjcheikhdiop
 
White and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptxWhite and Red Clean Car Business Pitch Presentation.pptx
White and Red Clean Car Business Pitch Presentation.pptx
canumatown
 
Understanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep WebUnderstanding the Tor Network and Exploring the Deep Web
Understanding the Tor Network and Exploring the Deep Web
nabilajabin35
 
Best web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you businessBest web hosting Vancouver 2025 for you business
Best web hosting Vancouver 2025 for you business
steve198109
 
5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx5-Proses-proses Akuisisi Citra Digital.pptx
5-Proses-proses Akuisisi Citra Digital.pptx
andani26
 
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHostingTop Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
Top Vancouver Green Business Ideas for 2025 Powered by 4GoodHosting
steve198109
 
Smart Mobile App Pitch DeckäžšAI Travel App Presentation Template
Smart Mobile App Pitch DeckäžšAI Travel App Presentation TemplateSmart Mobile App Pitch DeckäžšAI Travel App Presentation Template
Smart Mobile App Pitch DeckäžšAI Travel App Presentation Template
yojeari421237
 
project_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptxproject_based_laaaaaaaaaaearning,kelompok 10.pptx
project_based_laaaaaaaaaaearning,kelompok 10.pptx
redzuriel13
 
IT Services Workflow From Request to Resolution
IT Services Workflow From Request to ResolutionIT Services Workflow From Request to Resolution
IT Services Workflow From Request to Resolution
mzmziiskd
 

Real time analytics at uber @ strata data 2019

  • 1. Real Time Analytics at Uber: Bring SQL into Everything Zhenxiao Luo
  • 2. NYC Uber’s mission is to ignite opportunity by setting the world in motion. 15M Trips/Day 600+ Cities 75M Monthly Riders
  • 3. Data informs every decision at the company
  • 4. Overview of Uber’s Data Platform DATA SOURCES RAW DATA MODELED TABLES MINING BUSINESS INSIGHTS CONSUMING BUSINESS INSIGHTS EXPERIMENTATION DATA SCIENCE MACHINE LEARNING CUSTOM DATA SETS Dashboarding Alerting Monitoring Data Exploration Knowledge Bases Storage Infrastructure ETL Frameworks Data Integrity Query Engines
  • 5. Kafka Uber Data Infrastructure Schemaless MySQL, Postgres Vertica Streamio Raw Data Raw Tables Sqoop Reports Hadoop Hive Presto Spark Notebook Ad Hoc Queries Real Time Applications Machine Learning Jobs Business Intelligence Jobs Cluster Management All-Active Observability Security Vertica Samza Pinot Flink AresDB Modeled Tables Streaming Warehouse Real-time
  • 6. Presto @ Uber-scale 5KWeekly Active Users 160KQueries/day 3Data Centers 2KNodes 700MHDFS ïŹles read/day 10PBHDFS ïŹles processed/day
  • 7. Presto use cases at Uber Growth Marketing Data Science Marketplace Pricing Community Operations Data Quality Ad-hoc Querying
  • 8. The people who rely on us Technical Skills Data Scientists Software Engineers ML/AI Researchers Advanced SQL Advanced Statistics Scala/Spark, Python/R Data Modeling Inventor Ivan Marketing Managers Entry-level Analysts General Managers Product Managers Limited SQL Spreadsheets Reliant Rebecca City Operations Regional Managers Intermediate SQL Spreadsheets Dashboarding Monitoring Matt Operations Managers Data Analysts Product Analysts Advanced SQL Spreadsheets Limited Statistics Limited Python/R Analyst Anna
  • 9. Exploratory ML & model-training Data Scientists ML ResearchersEngineers Using ML to ensure data security and compliance
  • 10. Advanced data science & complex analytics Data Scientists Ops Analysts Support Agents Surfacing hidden insights to empower restaurants
  • 11. Business process automation S&P AnalystsOps Managers Contractors Using technology to make transportation safer
  • 12. What is Presto: Interactive SQL Engine for Big Data Interactive query speeds Horizontally scalable ANSI SQL Battle-tested by Facebook, Uber, Linkedin, Twitter, NetïŹ‚ix, Airbnb, etc Completely open source Access to petabytes of data in the Hadoop, Elasticsearch, Pinot, etc.
  • 14. Why Presto is Fast ● Data in memory during execution ● Pipelining and streaming ● Columnar storage & execution ● Bytecode generation
  • 15. Resource Management ● Presto has its own resource manager ○ Not on YARN ○ Not on Mesos ● CPU Management ○ Priority queues ○ Short running queries higher priority ● Memory Management ○ Max memory per query per node ○ If query exceeds max memory limit, query fails ○ No OutOfMemory in Presto process
  • 18. Contributions New Features ● Geospatial indexing and operations - 10x or more speedup ● Pinot connector enhancements (in-house) Optimizations ● Elasticsearch connector ● New Parquet reader - 4x speedup ● Nested column pushdowns (project, predicate) - 10x speedup Security ● Metastore authentication support for Kerberos deployments ● Dispatch Proxy using HTTP redirect for multi-cluster operation
  • 19. Presto Connector Interface ● ConnectorMetadata ○ Schema, Table, Column ● ConnectorSplitManager ○ Divide data into splits ● ConnectorSplit ○ Split data range ○ Predicate/JsonFunction/Limit pushdown ● ConnectorRecordCursor ○ Transform underlying storage data into Presto internal page/block
  • 21. Data Model ● each Elasticsearch index is a table partition ● each ïŹeld of an index is a column ● all Elasticsearch indexes sharing the same preïŹx consist a logical table ○ Es-vehicles-sjc1, es-vehicles-dca1, es-vehicles
  • 23. Query
  • 24. Optimizations ● Parallel Reads ○ Get all indices and search nodes ○ For each search node, send request for one speciïŹc index ● Cap Max Hits ● Predicate Pushdown ● Json Function Pushdown ● Limit Pushdown ● Nested Fields
  • 25. How many Uber trip requests did we serve in Chicago yesterday?
  • 26. Fetch daily trip count in seconds SELECT T.base.city_id AS cid, Count(CASE WHEN T.base.status = 'completed' THEN 1 END) AS completed_trips, Count(CASE WHEN T.base.status = 'canceled' THEN 1 END) AS rider_canceled_trips FROM trips AS T WHERE T.datestr = '2019-03-11' GROUP BY 1
  • 27. Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.status Column Chunk base.vehicle_id Column Chunk base.city_idRow Group Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.city_id Column Chunk base.vehicle_id Column Chunk base.statusRow Group Parquet Parquet Footer: File Metadata, Row Group Metadata Step 1: Read all Parquet nested ïŹelds from disk base.driver_uuid base.client_uuid base.city_id 
... base.vehicle_id base.status base.driver_uuid base.driver_uuid base.driver_uuid base.driver_uuid base.client_uuid base.city_id 
... base.vehicle_id base.status Presto Columnar Engine Step 2: Transform Parquet rows into Presto columnar blocks Step 3: Evaluate predicates on columnar blocks base.client_uuid base.client_uuid base.client_uuid base.city_id base.city_id base.city_id base.vehicle_id base.vehicle_id base.vehicle_id base.status base.status base.status 
. Default Apache Parquet Reader
  • 28. Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.status Column Chunk base.vehicle_id Column Chunk base.city_id Row Group Column Chunk base.client_uuid Column Chunk base.driver_uuid Column Chunk base.city_id Column Chunk base.vehicle_id Column Chunk base.status Row Group Parquet Footer: File Metadata, Row Group Metadata Step 1: Read ONLY Required nested ïŹelds from disk Presto Columnar Engine Apache Parquet Reader Optimization base.driver_uuid base.driver_uuid base.driver_uuid base.city_id base.city_id base.city_id Step 1: Read ONLY Required nested ïŹelds from disk Evaluate predicates on the ïŹ‚y: Skip reading row group; predicate: base.city_id = 12 dictionary: base.city_id: {3, 5, 9, 14, 21} Build columnar blocks only for predicate matches Step 2. Build columnar blocks on the ïŹ‚y base.driver_uuid base.driver_uuid base.driver_uuid Step 3: Evaluate predicates on columnar blocks Parquet
  • 31. Federated SQL Layer Vision HDFS VerticaElasticsearch Apache Pinot MySQL Machines Reports Users Presto RealTime Presto Proxy layer Management Universal Metadata Service
  • 32. Focus areas Connectors ● Apache Hive, Apache Pinot, Elasticsearch, Apache Cassandra, Vertica, MySQL, etc ● Aggregation / Join pushdown ● Cross-connector optimizations (hybrid connectors) Real-time ● Real-time mode with low latency pass through ● Query plan / result / data cache ● Time-series joins and stitching Universal Metadata Service (UMS) ● Logical deïŹnitions / physical schemas ● Column stitching and joins ● Table and partition caching
  • 33. Thank you Proprietary © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed. All recipients of this document are notiïŹed that the information contained herein includes proprietary information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.