Cloudera Impala + PostgreSQL

Running Cloudera Impala on PostgreSQL

By Chengzhong Liu
liuchengzhong@miaozhen.com
2013.12

Story coming from…
• Data gravity
• Why big data
• Why SQL on big data

Today agenda
•
•
•
•
•
•

Big data in Miaozhen 秒针系统
Overview of Cloudera Impala
Hacking practice in Cloudera Impala
Performance
Conclusions
Q&A

What happened in miaozhen
• 3 billion Ads impression per day
• 20TB data scan for report generation every morning
• 24 servers cluster

• Besides this
–
–
–
–

TV Monitor
Mobile Monitor
Site Monitor
…

Before Hadoop
• Scrat
– PostgreSQL 9.1 cluster
– Write a simple proxy
– <2s for 2TB data scan

• Mobile Monitor
– Hadoop-like distribute computing system
– Rabbit MQ + 3 computing servers
– Write a Map-Reduce in C++
– Handles 30 millions to 500 millions Ads impression

Problem & Chance
• Database cluster
• SQL on Hadoop
• Miscellaneous data
• Requirements
– Most data is rational
– SQL interface

SQL on Hadoop
•
•
•
•
•

Google Dremel
Apache Drill
Cloudera Impala
Facebook Presto
EMC Greenplum/Pivotal

Latency matters

Pig

Impala/Drill
/Pivotal/Presto

Map Reduce

HDFS

Hive

What’s this
• A kind of MPP engine
• In memory processing
• Small to big join
– Broadcast join

• Small result size

Why Cloudera Impala
• The team move fast
– UDF coming out
– Better join strategy on the way

• Good code base
– Modularize
– Easy to add sub classes

• Really fast
– Llvm code generation
• 80s/95s – uv test

– Distributed aggregation Tree
– In-situ data processing (inside storage)

Typical Arch.
SQL Interface

Meta Store

Query
Planner

Query
Planner

Query
Planner

Coordinat
or

Coordinat
or

Coordinat
or

Exec
Engine

Exec
Engine

Exec
Engine

Our target
• A MPP database
– Build on PostgreSQL9.1
– Scale well
– Speed

• A mixed data source MPP query engine
– Join two tables in different sources
– In fact…

Hacking… from where
• Add, not change
– Scan Node type
– DB Meta info

• Put changes in configuration
– Thrift Protocol update
• TDBHostInfo
• TDBScanNode

Front end
• Meta store update
– Link data to the table name
– Table location management

• Front end
– Compute table location

Back end
• Coordinator
– pg host

• New scan node type
– db scan node
• Pg scan node
• Psql library using cursor

SQL Plan
• select count(distinct id)
from table
– MR like process

HDFS/PG scan
Aggr. : group by id

Exchange node
Aggr. : group by id
Aggr. : count(id)

Exchange node
Aggr.: sum(count(id)

Env.
• Ads impression logs
– 150 millions, 100KB/line

• 3 servers
–
–
–
–

24 cores
32 G mem
2T * 12 HD
100Mbps LAN

• Query
– Select count(id) from t group by campaign
– Select count(distinct id) from t group by campaign
– Select * from t where id = ‘xxxxxxxx’

Performance
• Group by speed / core
• 20 M /s
700
600
500
400

impala
hive

300

pg+impala

200
100
0
1

2

3

Codegen on/off
• select count(distinct id)
from t group by c

100
90
80
70

• select distinct id
from t

60
50

en_codegen

40

dis_codegen

30

•

20
select id from t
10
group by id
0
having
uv_test
count(case when c = '1' then 1 else null end) > 0
and
count(case when c= 2' then 1 else null end) > 0
limit 10;

distinct

duplicated

Conclusion
• Source quality
– Readable
– Google C++ style
– Robust

• MPP solution based on PG
– Proved perf.
– Easy to scale

• Mixed engine usage
– HDFS and DB

What’s next
•
•
•
•
•

Yarn integrating
UDF
Join with Big table
BI roadmap
Fail over

Rerf.
• Cloudera Impala online doc. & src
• http://files.meetup.com/1727991/Impala%20and
%20BigQuery.ppt
• http://www.cubrid.org/blog/dev-platform/meetimpala-open-source-real-time-sql-querying-onhadoop/
• http://berlinbuzzwords.de/sites/berlinbuzzwords.
de/files/slides/Impala%20tech%20talk.pdf
• @datascientist, @dongxicheng, @flyingsk, @zhh

Cloudera Impala + PostgreSQL

Recommended

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Cloudera Impala + PostgreSQL (20)

Recently uploaded (20)

Cloudera Impala + PostgreSQL