SlideShare a Scribd company logo
Copyright © 2013 Cloudera Inc. All rights reserved.
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop Beyond Batch: 

Real-time Workloads, SQL-on-
Hadoop, and the Virtual EDW
Marcel Kornacker | marcel@cloudera.com 
April 2014
Copyright © 2013 Cloudera Inc. All rights reserved.
Analytic Workloads on Hadoop: Where Do
We Stand?
!2
“DeWitt Clause” prohibits
using DBMS vendor name
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
•Hadoop has traditional been utilized for offline batch processing:
ETL and ELT
•Next step: Hadoop for traditional business intelligence (BI)/data
warehouse (EDW) workloads:
•interactive
•concurrent users
•Topic of this talk: a Hadoop-based open-source stack for EDW
workloads:
•HDFS: a high-performance storage system
•Parquet: a state-of-the-art columnar storage format
•Impala: a modern, open-source SQL engine for Hadoop
!3
Copyright © 2013 Cloudera Inc. All rights reserved.
Hadoop for Analytic Workloads
•Thesis of this talk:
•techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
•Hadoop stack is effective solution for certain EDW workloads
•Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!4
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: A Storage System for Analytic
Workloads
•Available in Hdfs today:
•high-efficiency data scans at or near hardware speed, both
from disk and memory
•On the immediate roadmap:
•co-partitioned tables for even faster distributed joins
•temp-FS: write temp table data straight to memory,
bypassing disk

!5
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
•High efficiency data transfers
•short-circuit reads: bypass DataNode protocol when reading
from local disk

-> read at 100+MB/s per disk
•HDFS caching: access explicitly cached data w/o copy or
checksumming

-> access memory-resident data at memory bus speed

-> enable in-memory processing
!6
Copyright © 2013 Cloudera Inc. All rights reserved.
HDFS: The Details
•Coming attractions:
•affinity groups: collocate blocks from different files

-> create co-partitioned tables for improved join
performance
•temp-fs: write temp table data straight to memory,
bypassing disk

-> ideal for iterative interactive data analysis
!7
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Columnar Storage for Hadoop
•What it is:
•state-of-the-art, open-source columnar file format that’s
available for (most) Hadoop processing frameworks:

Impala, Hive, Pig, MapReduce, Cascading, …
•offers both high compression and high scan efficiency
•co-developed by Twitter and Cloudera; hosted on github and
soon to be an Apache incubator project
•with contributors from Criteo, Stripe, Berkeley AMPlab,
LinkedIn
•used in production at Twitter and Criteo
!8
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: The Details
•columnar storage: column-major instead of the traditional
row-major layout; used by all high-end analytic DBMSs
•optimized storage of nested data structures: patterned
after Dremel’s ColumnIO format
•extensible set of column encodings:
•run-length and dictionary encodings in current version (1.2)
•delta and optimized string encodings in 2.0
•embedded statistics: version 2.0 stores inlined column
statistics for further optimization of scan efficiency
!9
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Storage Efficiency
!10
Copyright © 2013 Cloudera Inc. All rights reserved.
Parquet: Scan Efficiency
!11
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
•implementation of an MPP SQL query engine for the Hadoop
environment
•highest-performance SQL engine for the Hadoop ecosystem;

already outperforms some of its commercial competitors
•effective for EDW-style workloads
•maintains Hadoop flexibility by utilizing standard Hadoop
components (HDFS, Hbase, Metastore, Yarn)
•plays well with traditional BI tools:

exposes/interacts with industry-standard interfaces (odbc/
jdbc, Kerberos and LDAP, ANSI SQL)
!12
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala: A Modern, Open-Source SQL Engine
•history:
•developed by Cloudera and fully open-source; hosted on
github
•released as beta in 10/2012
•1.0 version available in 05/2013
•current version is 1.2.3, available for CDH4 and CDH5 beta
!13
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
•create tables as virtual views over data stored in HDFS
or Hbase;

schema metadata is stored in Metastore (shared with
Hive, Pig, etc.; basis of HCatalog)
•connect via odbc/jdbc; authenticate via Kerberos or
LDAP
•run standard SQL:
•current version: ANSI SQL-92 (limited to SELECT and bulk
insert) minus correlated subqueries, has UDFs and UDAs
!14
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala from The User’s Perspective
•2014 roadmap:
•1.3: admission control, Order By without Limit,
Decimal(<precision>, <scale>)
•1.4: analytic window functions
•2.0: support for nested types (structs, arrays, maps), UDTFs,
disk-based joins and aggregation
!15
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture
•distributed service:
•daemon process (impalad) runs on every node with data
•easily deployed with Cloudera Manager
•each node can handle user requests; load balancer
configuration for multi-user environments recommended
•query execution phases:
•client request arrives via odbc/jdbc
•planner turns request into collection of plan fragments
•coordinator initiates execution on remote impala’s
!16
Copyright © 2013 Cloudera Inc. All rights reserved.
• Request arrives via odbc/jdbc
Impala Query Execution
!17
Copyright © 2013 Cloudera Inc. All rights reserved.
• Planner turns request into collection of plan fragments
• Coordinator initiates execution on remote impalad nodes
Impala Query Execution
!18
Copyright © 2013 Cloudera Inc. All rights reserved.
• Intermediate results are streamed between impala’s
• Query results are streamed back to client
Impala Query Execution
!19
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Planning
•2-phase process:
•single-node plan: left-deep tree of query operators
•partitioning into plan fragments for distributed parallel
execution:

maximize scan locality/minimize data movement, parallelize
all query operators
•cost-based join order optimization
•cost-based join distribution optimization
!20
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
•execution engine designed for efficiency, written from scratch
in C++; no reuse of decades-old open-source code
•circumvents MapReduce completely
•in-memory execution:
•aggregation results and right-hand side inputs of joins are
cached in memory
•example: join with 1TB table, reference 2 of 200 cols, 10% of
rows 

-> need to cache 1GB across all nodes in cluster

-> not a limitation for most workloads
!21
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
•runtime code generation:
•uses llvm to jit-compile the runtime-intensive parts of a
query
•effect the same as custom-coding a query:
•remove branches
•propagate constants, offsets, pointers, etc.
•inline function calls
•optimized execution for modern CPUs (instruction pipelines)
!22
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Architecture: Query Execution
!23
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
•Impala vs. SQL-on-MR
•Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”)
•file formats: Parquet/ORCfile
•TPC-DS, 3TB data set running on 5-node cluster
!24
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs MR for Analytic Workloads
• Impala speedup:
• interactive: 8-69x
• report: 6-68x
• deep analytics:
10-58x
!25
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
•Impala 1.2.3/Presto 0.6/Shark
•file formats: RCfile (+ Parquet)
•TPC-DS, 15TB data set running on 21-node cluster
!26
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!27
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!28
• Multi-user benchmark:
• 10 users concurrently
• same dataset, same
hardware
• workload: queries from
“interactive” group
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala vs non-MR for Analytic Workloads
!29
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
•Hadoop’s promise of linear scalability: add more
nodes to cluster, gain a proportional increase in
capabilities

-> adapt to any kind of workload changes simply by
adding more nodes to cluster
•scaling dimensions for EDW workloads:
•response time
•concurrency/query throughput
•data size
!30
Copyright © 2013 Cloudera Inc. All rights reserved.
Scalability in Hadoop
•Scalability results for Impala:
•tests show linear scaling along all 3 dimensions
•setup:
•2 clusters: 18 and 36 nodes
•15TB TPC-DS data set
•6 “interactive” TPC-DS queries
!31
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Latency
!32
Copyright © 2013 Cloudera Inc. All rights reserved.
• Comparison: 10 vs 20 concurrent users
Impala Scalability: Concurrency
!33
Copyright © 2013 Cloudera Inc. All rights reserved.
Impala Scalability: Data Size
• Comparison: 15TB vs. 30TB data set
!34
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Thesis of this talk:
•techniques and functionality of established commercial
solutions are either already available or are rapidly being
implemented in Hadoop stack
•Impala/Parquet/Hdfs is effective solution for certain EDW
workloads
•Hadoop-based EDW solution maintains Hadoop’s strengths:
flexibility, ease of scaling, cost effectiveness
!35
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•latest technological innovations add capabilities that
originated in high-end proprietary systems:
•high-performance disk scans and memory caching in HDFS
•Parquet: columnar storage for analytic workloads
•Impala: high-performance parallel SQL execution
!36
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Impala/Parquet/Hdfs for EDW workloads:
•integrates into BI environment via standard connectivity and
security
•comparable or better performance than commercial
competitors
•currently still SQL limitations
•but those are rapidly diminishing
!37
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•Impala/Parquet/Hdfs maintains traditional Hadoop
strengths:
•flexibility: Parquet is understood across the platform, natively
processed by most popular frameworks
•demonstrated scalability and cost effectiveness
!38
The End
!39
Copyright © 2013 Cloudera Inc. All rights reserved.
Summary: Hadoop for Analytic Workloads
•what the future holds:
•further performance gains
•more complete SQL capabilities
•improved resource mgmt and ability to handle multiple
concurrent workloads in a single cluster
!40
Ad

More Related Content

What's hot (20)

Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
Wangda Tan
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera, Inc.
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Muhammad Ali
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
Scott Leberknight
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
DataWorks Summit/Hadoop Summit
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
DataWorks Summit
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera, Inc.
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
Hortonworks
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
Gwen (Chen) Shapira
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
 
Real-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera ImpalaReal-Time Queries in Hadoop w/ Cloudera Impala
Real-Time Queries in Hadoop w/ Cloudera Impala
Data Science London
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
Wangda Tan
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
Ofir Manor
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera, Inc.
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
Scott Leberknight
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
OReillyStrata
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
DataWorks Summit
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera, Inc.
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
Cloudera, Inc.
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
Hortonworks
 

Viewers also liked (14)

Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
Rakuten Group, Inc.
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
Cloudera, Inc.
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
Gwen (Chen) Shapira
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
Cloudera, Inc.
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Caserta
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
Cloudera, Inc.
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017
John Maeda
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Cloudera, Inc.
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
[RakutenTechConf2013] [B-3_2] DWH/Hadoop in Rakuten Ichiba
Rakuten Group, Inc.
 
Introduction to Cloudera Search Training
Introduction to Cloudera Search TrainingIntroduction to Cloudera Search Training
Introduction to Cloudera Search Training
Cloudera, Inc.
 
Integrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle DatabaseIntegrated Data Warehouse with Hadoop and Oracle Database
Integrated Data Warehouse with Hadoop and Oracle Database
Gwen (Chen) Shapira
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
DataWorks Summit
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data HubsWhat Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
What Comes After The Star Schema? Dimensional Modeling For Enterprise Data Hubs
Cloudera, Inc.
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Caserta
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Caserta
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
Cloudera, Inc.
 
Design in Tech Report 2017
Design in Tech Report 2017Design in Tech Report 2017
Design in Tech Report 2017
John Maeda
 
Ad

Similar to Building a Hadoop Data Warehouse with Impala (20)

Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
Nicolas Morales
 
Introducing Kudu
Introducing KuduIntroducing Kudu
Introducing Kudu
Jeremy Beard
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Timothy Spann
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)Bay Area Impala User Group Meetup (Sept 16 2014)
Bay Area Impala User Group Meetup (Sept 16 2014)
Cloudera, Inc.
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
Nicolas Morales
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)Cloudera Operational DB (Apache HBase & Apache Phoenix)
Cloudera Operational DB (Apache HBase & Apache Phoenix)
Timothy Spann
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
cdmaxime
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
Ad

More from Swiss Big Data User Group (20)

Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
Swiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
Swiss Big Data User Group
 
Data Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2CData Analytics – B2B vs. B2C
Data Analytics – B2B vs. B2C
Swiss Big Data User Group
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
Swiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
Swiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
Swiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Swiss Big Data User Group
 
Educating Data Scientists of the Future
Educating Data Scientists of the FutureEducating Data Scientists of the Future
Educating Data Scientists of the Future
Swiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
Swiss Big Data User Group
 
Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?Big data for Telco: opportunity or threat?
Big data for Telco: opportunity or threat?
Swiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
Swiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
Swiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
Swiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
Swiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
Swiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
Swiss Big Data User Group
 
In-Store Analysis with Hadoop
In-Store Analysis with HadoopIn-Store Analysis with Hadoop
In-Store Analysis with Hadoop
Swiss Big Data User Group
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
Swiss Big Data User Group
 
Introduction to Apache Drill
Introduction to Apache DrillIntroduction to Apache Drill
Introduction to Apache Drill
Swiss Big Data User Group
 
Oracle's BigData solutions
Oracle's BigData solutionsOracle's BigData solutions
Oracle's BigData solutions
Swiss Big Data User Group
 
Making Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to useMaking Hadoop based analytics simple for everyone to use
Making Hadoop based analytics simple for everyone to use
Swiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operatorA real life project using Cassandra at a large Swiss Telco operator
A real life project using Cassandra at a large Swiss Telco operator
Swiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
Swiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companiesBig Data and Data Science for traditional Swiss companies
Big Data and Data Science for traditional Swiss companies
Swiss Big Data User Group
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
Swiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data WarehouseUnleash the power of Big Data in your existing Data Warehouse
Unleash the power of Big Data in your existing Data Warehouse
Swiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
 Project "Babelfish" - A data warehouse to attack complexity Project "Babelfish" - A data warehouse to attack complexity
Project "Babelfish" - A data warehouse to attack complexity
Swiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density ChoiceBrainserve Datacenter: the High-Density Choice
Brainserve Datacenter: the High-Density Choice
Swiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maketUrturn on AWS: scaling infra, cost and time to maket
Urturn on AWS: scaling infra, cost and time to maket
Swiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC DatagridThe World Wide Distributed Computing Architecture of the LHC Datagrid
The World Wide Distributed Computing Architecture of the LHC Datagrid
Swiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph databaseNew opportunities for connected data : Neo4j the graph database
New opportunities for connected data : Neo4j the graph database
Swiss Big Data User Group
 
Technology Outlook - The new Era of computing
Technology Outlook - The new Era of computingTechnology Outlook - The new Era of computing
Technology Outlook - The new Era of computing
Swiss Big Data User Group
 

Recently uploaded (20)

AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 

Building a Hadoop Data Warehouse with Impala

  • 1. Copyright © 2013 Cloudera Inc. All rights reserved. Headline Goes Here Speaker Name or Subhead Goes Here Hadoop Beyond Batch: 
 Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Marcel Kornacker | [email protected] April 2014
  • 2. Copyright © 2013 Cloudera Inc. All rights reserved. Analytic Workloads on Hadoop: Where Do We Stand? !2 “DeWitt Clause” prohibits using DBMS vendor name
  • 3. Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads •Hadoop has traditional been utilized for offline batch processing: ETL and ELT •Next step: Hadoop for traditional business intelligence (BI)/data warehouse (EDW) workloads: •interactive •concurrent users •Topic of this talk: a Hadoop-based open-source stack for EDW workloads: •HDFS: a high-performance storage system •Parquet: a state-of-the-art columnar storage format •Impala: a modern, open-source SQL engine for Hadoop !3
  • 4. Copyright © 2013 Cloudera Inc. All rights reserved. Hadoop for Analytic Workloads •Thesis of this talk: •techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack •Hadoop stack is effective solution for certain EDW workloads •Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !4
  • 5. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: A Storage System for Analytic Workloads •Available in Hdfs today: •high-efficiency data scans at or near hardware speed, both from disk and memory •On the immediate roadmap: •co-partitioned tables for even faster distributed joins •temp-FS: write temp table data straight to memory, bypassing disk
 !5
  • 6. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details •High efficiency data transfers •short-circuit reads: bypass DataNode protocol when reading from local disk
 -> read at 100+MB/s per disk •HDFS caching: access explicitly cached data w/o copy or checksumming
 -> access memory-resident data at memory bus speed
 -> enable in-memory processing !6
  • 7. Copyright © 2013 Cloudera Inc. All rights reserved. HDFS: The Details •Coming attractions: •affinity groups: collocate blocks from different files
 -> create co-partitioned tables for improved join performance •temp-fs: write temp table data straight to memory, bypassing disk
 -> ideal for iterative interactive data analysis !7
  • 8. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Columnar Storage for Hadoop •What it is: •state-of-the-art, open-source columnar file format that’s available for (most) Hadoop processing frameworks:
 Impala, Hive, Pig, MapReduce, Cascading, … •offers both high compression and high scan efficiency •co-developed by Twitter and Cloudera; hosted on github and soon to be an Apache incubator project •with contributors from Criteo, Stripe, Berkeley AMPlab, LinkedIn •used in production at Twitter and Criteo !8
  • 9. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: The Details •columnar storage: column-major instead of the traditional row-major layout; used by all high-end analytic DBMSs •optimized storage of nested data structures: patterned after Dremel’s ColumnIO format •extensible set of column encodings: •run-length and dictionary encodings in current version (1.2) •delta and optimized string encodings in 2.0 •embedded statistics: version 2.0 stores inlined column statistics for further optimization of scan efficiency !9
  • 10. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Storage Efficiency !10
  • 11. Copyright © 2013 Cloudera Inc. All rights reserved. Parquet: Scan Efficiency !11
  • 12. Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine •implementation of an MPP SQL query engine for the Hadoop environment •highest-performance SQL engine for the Hadoop ecosystem;
 already outperforms some of its commercial competitors •effective for EDW-style workloads •maintains Hadoop flexibility by utilizing standard Hadoop components (HDFS, Hbase, Metastore, Yarn) •plays well with traditional BI tools:
 exposes/interacts with industry-standard interfaces (odbc/ jdbc, Kerberos and LDAP, ANSI SQL) !12
  • 13. Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine •history: •developed by Cloudera and fully open-source; hosted on github •released as beta in 10/2012 •1.0 version available in 05/2013 •current version is 1.2.3, available for CDH4 and CDH5 beta !13
  • 14. Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective •create tables as virtual views over data stored in HDFS or Hbase;
 schema metadata is stored in Metastore (shared with Hive, Pig, etc.; basis of HCatalog) •connect via odbc/jdbc; authenticate via Kerberos or LDAP •run standard SQL: •current version: ANSI SQL-92 (limited to SELECT and bulk insert) minus correlated subqueries, has UDFs and UDAs !14
  • 15. Copyright © 2013 Cloudera Inc. All rights reserved. Impala from The User’s Perspective •2014 roadmap: •1.3: admission control, Order By without Limit, Decimal(<precision>, <scale>) •1.4: analytic window functions •2.0: support for nested types (structs, arrays, maps), UDTFs, disk-based joins and aggregation !15
  • 16. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture •distributed service: •daemon process (impalad) runs on every node with data •easily deployed with Cloudera Manager •each node can handle user requests; load balancer configuration for multi-user environments recommended •query execution phases: •client request arrives via odbc/jdbc •planner turns request into collection of plan fragments •coordinator initiates execution on remote impala’s !16
  • 17. Copyright © 2013 Cloudera Inc. All rights reserved. • Request arrives via odbc/jdbc Impala Query Execution !17
  • 18. Copyright © 2013 Cloudera Inc. All rights reserved. • Planner turns request into collection of plan fragments • Coordinator initiates execution on remote impalad nodes Impala Query Execution !18
  • 19. Copyright © 2013 Cloudera Inc. All rights reserved. • Intermediate results are streamed between impala’s • Query results are streamed back to client Impala Query Execution !19
  • 20. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Planning •2-phase process: •single-node plan: left-deep tree of query operators •partitioning into plan fragments for distributed parallel execution:
 maximize scan locality/minimize data movement, parallelize all query operators •cost-based join order optimization •cost-based join distribution optimization !20
  • 21. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution •execution engine designed for efficiency, written from scratch in C++; no reuse of decades-old open-source code •circumvents MapReduce completely •in-memory execution: •aggregation results and right-hand side inputs of joins are cached in memory •example: join with 1TB table, reference 2 of 200 cols, 10% of rows 
 -> need to cache 1GB across all nodes in cluster
 -> not a limitation for most workloads !21
  • 22. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution •runtime code generation: •uses llvm to jit-compile the runtime-intensive parts of a query •effect the same as custom-coding a query: •remove branches •propagate constants, offsets, pointers, etc. •inline function calls •optimized execution for modern CPUs (instruction pipelines) !22
  • 23. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Architecture: Query Execution !23
  • 24. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads •Impala vs. SQL-on-MR •Impala 1.1.1/Hive 0.12 (“Stinger Phases 1 and 2”) •file formats: Parquet/ORCfile •TPC-DS, 3TB data set running on 5-node cluster !24
  • 25. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs MR for Analytic Workloads • Impala speedup: • interactive: 8-69x • report: 6-68x • deep analytics: 10-58x !25
  • 26. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads •Impala 1.2.3/Presto 0.6/Shark •file formats: RCfile (+ Parquet) •TPC-DS, 15TB data set running on 21-node cluster !26
  • 27. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !27
  • 28. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !28 • Multi-user benchmark: • 10 users concurrently • same dataset, same hardware • workload: queries from “interactive” group
  • 29. Copyright © 2013 Cloudera Inc. All rights reserved. Impala vs non-MR for Analytic Workloads !29
  • 30. Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop •Hadoop’s promise of linear scalability: add more nodes to cluster, gain a proportional increase in capabilities
 -> adapt to any kind of workload changes simply by adding more nodes to cluster •scaling dimensions for EDW workloads: •response time •concurrency/query throughput •data size !30
  • 31. Copyright © 2013 Cloudera Inc. All rights reserved. Scalability in Hadoop •Scalability results for Impala: •tests show linear scaling along all 3 dimensions •setup: •2 clusters: 18 and 36 nodes •15TB TPC-DS data set •6 “interactive” TPC-DS queries !31
  • 32. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Latency !32
  • 33. Copyright © 2013 Cloudera Inc. All rights reserved. • Comparison: 10 vs 20 concurrent users Impala Scalability: Concurrency !33
  • 34. Copyright © 2013 Cloudera Inc. All rights reserved. Impala Scalability: Data Size • Comparison: 15TB vs. 30TB data set !34
  • 35. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Thesis of this talk: •techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack •Impala/Parquet/Hdfs is effective solution for certain EDW workloads •Hadoop-based EDW solution maintains Hadoop’s strengths: flexibility, ease of scaling, cost effectiveness !35
  • 36. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •latest technological innovations add capabilities that originated in high-end proprietary systems: •high-performance disk scans and memory caching in HDFS •Parquet: columnar storage for analytic workloads •Impala: high-performance parallel SQL execution !36
  • 37. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Impala/Parquet/Hdfs for EDW workloads: •integrates into BI environment via standard connectivity and security •comparable or better performance than commercial competitors •currently still SQL limitations •but those are rapidly diminishing !37
  • 38. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •Impala/Parquet/Hdfs maintains traditional Hadoop strengths: •flexibility: Parquet is understood across the platform, natively processed by most popular frameworks •demonstrated scalability and cost effectiveness !38
  • 40. Copyright © 2013 Cloudera Inc. All rights reserved. Summary: Hadoop for Analytic Workloads •what the future holds: •further performance gains •more complete SQL capabilities •improved resource mgmt and ability to handle multiple concurrent workloads in a single cluster !40