Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5

1© Cloudera, Inc. All rights reserved.
Hive, Impala, and Spark, Oh My:
SQL-on-Hadoop in Cloudera 5.5
Justin Erickson | Director of Product Management |
Cloudera

Agenda
• History of SQL-on-Hadoop technologies
• Picking the right tool for the job
• What’s new with Cloudera 5.5
• Real-world use cases
• Future of SQL-on-Hadoop

MapReduce: The Early Years
The original processing engine for Hadoop
• Process any type of data in any format
• Scale infinitely for multiple, large jobs
• Pioneer of bringing compute to data
But…
• Difficult to program
• Slow processing
• Limited expressivity
PROCESS
STORE
BATCH
MapReduce
FILESYSTEM
HDFS

One Platform, Many Workloads
Batch, Interactive,
and Real-Time.
Leading performance and
usability in one platform.
• End-to-end analytic workflows
• Access more data
• Work with data in new ways
• Enable new users
OPERATIONS
Cloudera Manager
Cloudera Director
DATA
MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite

The Need for SQL for Batch Processing
Apache Hive
• Eases development on MapReduce with
familiar SQL
• Built for long-running ETL, data
preparation, and batch processing
• Shared data structures across Hadoop
tools
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite

The Need for Interactive SQL for BI
Apache Impala (incubating)
• Low latency for interactive performance
• Built for multi-user workloads
• Compatible with SQL and leading BI
partner tools
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite

The Need for Flexible Data Processing
Apache Spark (and Spark SQL)
• Easy development
• Flexible, extensible API across multiple
workload types
• In-memory batch and stream processing
performance boost
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
FILESYSTEM
HDFS
RELATIONAL
Kudu
NoSQL
HBase
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
SDK
Kite

Focus on Open Source Standards
Open source does not guarantee a future-proof investment
Long-Term Architecture
Only open standards get continuing,
long-term investment from across
the ecosystem.
Avoidance of Lock-in
Open standards have multi-vendor
support, giving customers choices
and preventing lock-in.
Ecosystem Compatibility
Open standards attract more third-
party connectors/certifications due
to broad adoption.

Choosing the Right SQL Engine
Know Your Audience, Know Your Use Case
Batch
Processing
BI and
SQL Analytics
Procedural
Development
SQLOR
Impala

SQL-on-Hadoop in Cloudera 5.5
Apache Hive Apache Impala (incubating) Apache Spark SQL
Audience ETL Developers Business Analysts Data Engineers & Data Scientists
Strengths • Built for very long-running
ETL, data preparation, or
batch processing
• Supports custom file formats
• Handles massive ETL sorts
with joins
• Scales to high-concurrency
• Supports high-performance
interactive SQL
• Compatible with BI tools &
skills
• Hadoop integration & usability
• Easily embed SQL into Java,
Scala, or Python applications
• Simple language for common
operations
• Seamlessly mix SQL and Spark
code within a single application
New
Features
• Hive in the cloud (S3)
• Hive-on-Spark beta
• Governance & Lineage
• Nested data types
• Column-level security
• Integration with Kudu (beta)
• Support for Spark SQL &
DataFrames
• Hive integration
• Automatic performance
optimizations

SQL-on-Hadoop Benchmark
Impala, Spark SQL, Hive-on-Tez
Versions:
• Impala 2.3
• Hive 2.0 on Tez 0.5.2 (aka “Stinger”)
• Spark SQL 1.5 with Tungsten
• Benchmark Details
• Based on industry standards (TPC)
• Repeatable
• Methodical testing with multiple runs
on same hardware
• Help competing software do well
• Run on optimal file formats for each
• Tune query engines appropriately

Impala Multi-User Performance Over 7x Faster
0
50
100
150
200
250
Time(inSeconds)
SingleUser,4
10Users,12.8
SingleUser,32
10Users,97
SingleUser,59
10Users,210
7.2x
7.6x
13.4x
16.4x
Single User vs 10 User Response Time/Impala
Times Faster
(Lower Bars = Better)
Impala Spark SQL
(with Tungsten)
Hive-on-Tez

Impala Enables Nearly 7x Throughput
2045
302
136.0
0
500
1000
1500
2000
2500
QueriesperHour
Query Throughput/Impala Throughput Times Faster
(Higher Bars = Better)
6.8x 15x
Impala Hive-on-TezSpark SQL
(with Tungsten)

Performance Benchmark Takeaways
• Impala unlocks BI usage directly on Hadoop
• Meets BI low-latency and multi-user requirements
• Advantage expands for single-user vs just 10 users
• Hive is designed (and still great) for batch processing
• Most Impala customers use Hive for data preparation
• Hive is the most commonly used ETL framework
• Spark SQL enables easier Spark application development
• Enables mixed procedural Spark (Java/Scala) and SQL job development
• Mid-term trends will further favor Impala’s design approach for latency and concurrency
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)
• CPU efficiency will increase in importance
• Native code enables easy optimizations for CPU instruction sets
• Intel joint roadmap support these opportunities

Use Cases

PROBLEM
SOLUTION
Needed to efficiently collect, process, and
analyze data from growing hospital network
• EDW couldn’t meet scale and unstructured
data demands
• Processing too slow for actionable
decisions
• Limited, time consuming supply chain
matching
Integrated 1000s of hospital systems through
unified enterprise data hub
• Ingest and process 45% more spend data
• Faster analytics on $41B through end-user
healthcare spend dashboard
• Unprecedented matching of 98% of supply
chain data
• Better TCO through unification and
licensing costs for new opportunities

PROBLEM
SOLUTION
Clients had limited insights to thousands of
marketing campaigns across channels
• Clients want real-time campaign updates
with 3-sec SLA
• Existing system couldn’t meet scaling or
data type demands
• Limited self-service BI
Built next-generation digital marketing
platform for 360-degree customer view
• Improved query performance from
minutes to seconds to meet SLAs
• Enhanced modeling with combined online
and offline data
• Real-time optimizations through
interactive, self-service access

PROBLEM
SOLUTION
Couldn’t support data integration across 20+
brands
• Existing systems couldn’t scale for data
consolidation
• Siloed access based on workload
• No real-time data ingestion or access
Brought all data directly to the business to
lower costs and open up new use cases fast
• Reduced TCO by 50% by consolidating over
1PB of data, adding 200M rows daily
• Enabled real-time vs hourly updates on ad
performance
• Optimized inventory management
through data matching and consolidation

Impala Roadmap
2H 2015 1H 2016 2016
• SQL Support & Usability
• Nested structures
• Kudu updates (beta)
• Management & Security
• Record reader service
(beta)
• Finer-grained security
(Sentry)
• Integration
• Isilon support
• Python interface (Ibis)
• Performance & Scale
• Improved predictability
under concurrency
• Continued scalability and
concurrency
• Initial perf/scale
improvements
• Improved admission
control
• Resource utilization and
showback
• Dynamic partitioning
• Improved timestamp
compatibility
• >20x performance
• Multi-threaded
joins/aggregations
• Continued scale work
• Improved YARN
integration
• Automated metadata
• Integration
• S3 support
• Nested types with Avro
• Date type
• Added SQL extensions

Download Cloudera 5.5
cloudera.com/downloads

Try It With Cloudera Live
cloudera.com/live
Featuring tutorials on:

Cloudera Enterprise
Making Hadoop Fast, Easy, and Secure
A new kind of data
platform:
• One place for unlimited data
• Unified, multi-framework data
access
Cloudera makes it:
• Fast for business
• Easy to manage
• Secure without compromise
OPERATIONS
DATA
MANAGEMENT
STRUCTURED UNSTRUCTURED
UNIFIED SERVICES
RESOURCE MANAGEMENT SECURITY
FILESYSTEM RELATIONAL NoSQL
STORE
INTEGRATE
BATCH STREAM SQL SEARCH SDK

Thank You!

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5

More Related Content

What's hot (20)

Similar to Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5 (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5

Editor's Notes