Impala Architecture presentation

Impala: A Modern,
Open-Source SQL
Engine for Hadoop

Mark
Grover

So+ware
Engineer,
Cloudera

January
7th,
2014

Twi@er:
mark_grover

github.com/markgrover/impala-‐thug/

•  What
is
Impala
and
what
is
Hadoop?

•  ExecuPon
frameworks
on
Hadoop
–
MR,
Hive,
etc.

•  Goals
and
user
view
of
Impala

•  Architecture
of
Impala

•  Comparing
Impala
to
other
systems

•  Impala
Roadmap

Agenda

•  General-‐purpose
SQL
engine

•  Real-‐Pme
queries
in
Apache
Hadoop

What
is
Impala?

What
is
Apache
Hadoop?

4

Has the Flexibility to Store
and Mine Any Type of Data
§  Ask questions across structured and
unstructured data that were previously
impossible to ask or solve
§  Not bound by a single schema
Excels at
Processing Complex Data
§  Scale-out architecture divides
workloads across multiple nodes
§  Flexible file system eliminates ETL
bottlenecks
Scales
Economically
§  Can be deployed on commodity
hardware
§  Open source platform guards
against vendor lock
Hadoop
Distributed File
System (HDFS)
Self-Healing, High
Bandwidth Clustered
Storage
MapReduce
Distributed Computing
Framework
Apache Hadoop is an open
source platform for data storage and
processing that is…
ü  Distributed
ü  Fault tolerant
ü  Scalable
CORE HADOOP SYSTEM COMPONENTS

•  Batch
oriented

•  High
latency

•  Not
all
paradigms
ﬁt
very
well

•  Only
for
developers

So,
what’s
wrong
with
MapReduce?

•  MR
is
hard
and
only
for
developers

•  Higher
level
pla]orms
for
converPng
declaraPve

syntax
to
MapReduce

•  SQL
–
Hive

•  workﬂow
language
–
Pig

•  Build
on
top
of
MapReduce

What
are
Hive
and
Pig?

SQL
engine

•  Real-‐Pme
queries
in
Apache
Hadoop

•  Beta
version
released
since
October
2012

•  General
availability
(v1.0)
release
out
since
April
2013

•  Open
source
under
Apache
license

•  Latest
release
(v1.2.3)
released
on
December
23rd

What
is
Impala?

Impala
Overview:
Goals

SQL
query
engine:

•  Works
for
both
for
analyPcal
and
transacPonal/single-‐row

workloads

•  Supports
queries
that
take
from
milliseconds
to
hours

•  Runs
directly
within
Hadoop:

•  reads
widely
used
Hadoop
ﬁle
formats

•  talks
to
widely
used
Hadoop
storage
managers

•  runs
on
same
nodes
that
run
Hadoop
processes

•  High
performance:

•  C++
instead
of
Java

•  runPme
code
generaPon

•  completely
new
execuPon
engine
–
No
MapReduce

User
View
of
Impala:
Overview

•  Runs
as
a
distributed
service
in
cluster:
one
Impala
daemon
on

each
node
with
data

•  Highly
available:
no
single
point
of
failure

•  User
submits
query
via
ODBC/JDBC,
Impala
CLI
or
Hue
to
any

of
the
daemons

•  Query
is
distributed
to
all
nodes
with
relevant
data

•  Impala
uses
Hive's
metadata
interface,
connects
to
Hive

metastore

User
View
of
Impala:
Overview

•  There
is
no
‘Impala
format’!

•  There
is
no
‘Impala
format’!!

•  Supported
file
formats:

•  uncompressed/lzo-‐compressed
text
files

•  sequence
files
and
RCFile
with
snappy/gzip
compression

•  Avro
data
files

•  Parquet
columnar
format
(more
on
that
later)

User
View
of
Impala:
SQL

•  SQL
support:

•  essenPally
SQL-‐92,
minus
correlated
subqueries

•  INSERT
INTO
…
SELECT
…

•  only
equi-‐joins;
no
non-‐equi
joins,
no
cross
products

•  Order
By
requires
Limit

•  (Limited)
DDL
support

•  SQL-‐style
authorizaPon
via
Apache
Sentry
(incubaPng)

•  UDFs
and
UDAFs
are
supported

User
View
of
Impala:
SQL

•  FuncPonal
limitaPons:

•  No
ﬁle
formats,
SerDes

•  no
beyond
SQL
(buckets,
samples,
transforms,
arrays,

structs,
maps,
xpath,
json)

•  Broadcast
joins
and
parPPoned
hash
joins
supported

•  Smaller
table
has
to
ﬁt
in
aggregate
memory
of
all
execuPng

nodes

User
View
of
Impala:
HBase

•  FuncPonality
highlights:

•  Support
for
SELECT,
INSERT
INTO
…
SELECT
…,
and
INSERT

INTO
…
VALUES(…)

•  Predicates
on
rowkey
columns
are
mapped
into
start/stop

rows

•  Predicates
on
other
columns
are
mapped
into

SingleColumnValueFilters

•  But:
mapping
of
HBase
tables
metastore
table

pa@erned
a+er
Hive

•  All
data
stored
as
scalars
and
in
ascii

•  The
rowkey
needs
to
be
mapped
into
a
single
string

column

User
View
of
Impala:
HBase

•  Roadmap

•  Full
support
for
UPDATE
and
DELETE

•  Storage
of
structured
data
to
minimize
storage
and
access

overhead

•  Composite
row
key
encoding,
mapped
into
an
arbitrary

number
of
table
columns

Impala
Architecture

•  Three
binaries:
impalad,
statestored,
catalogd

•  Impala
daemon
(impalad)
–
N
instances

•  handles
client
requests
and
all
internal
requests
related
to

query
execuPon

•  State
store
daemon
(statestored)
–
1
instance

•  Provides
name
service
and
metadata
distribuPon

•  Catalog
daemon
(catalogd)
–
1
instance

•  Relays
metadata
changes
to
all
impalad’s

Impala
Architecture

•  Query
execuPon
phases

•  request
arrives
via
odbc/jdbc

•  planner
turns
request
into
collecPons
of
plan

fragments

•  coordinator
iniPates
execuPon
on
remote

impalad’s

Impala
Architecture

•  During
execuPon

•  intermediate
results
are
streamed
between

executors

•  query
results
are
streamed
back
to
client

•  subject
to
limitaPons
imposed
to
blocking

operators
(top-‐n,
aggregaPon)

Impala
Architecture:
Query
ExecuPon

Request
arrives
via
odbc/jdbc

Query
Planner

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Executor

HDFS
DN
HBase

SQL

request

Query
Coordinator
Query
Coordinator

HiveMeta
store

HDFS
NN

Statestore

+

Catalogd

Impala
Architecture:
Query
ExecuPon

Planner
turns
request
into
collecPons
of
plan
fragments

Coordinator
iniPates
execuPon
on
remote
impalad's

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

HiveMeta
store

HDFS
NN

Statestore

+

Catalogd

Impala
Architecture:
Query
ExecuPon

Intermediate
results
are
streamed
between
impalad's
Query

results
are
streamed
back
to
client

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

query

results

HiveMeta
store

HDFS
NN

Statestore

+

Catalogd

Query
Planning:
Overview

•  2-‐phase
planning
process:

•  single-‐node
plan:
le+-‐deep
tree
of
plan
operators

•  plan
parPPoning:
parPPon
single-‐node
plan
to
maximize
scan
locality,

minimize
data
movement

•  ParallelizaPon
of
operators:

•  All
query
operators
are
fully
distributed

Query Planning:
Single-‐Node
Plan

•  Plan
operators:
Scan,
HashJoin,
HashAggregaPon,
Union,

TopN,
Exchange

Single-‐Node
Plan:
Example
Query

SELECT
t1.cusPd,

SUM(t2.revenue)
AS
revenue

FROM
LargeHdfsTable
t1

JOIN
LargeHdfsTable
t2
ON
(t1.id1
=
t2.id)

JOIN
SmallHbaseTable
t3
ON
(t1.id2
=
t3.id)

WHERE
t3.category
=
'Online'

GROUP
BY
t1.cusPd

ORDER
BY
revenue
DESC
LIMIT
10;

Query Planning:
Single-‐Node
Plan

HashJoin
Scan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Agg
•  Single-‐node
plan
for
example:

Query Planning:
Distributed
Plans

•  Goals:

o  maximize
scan
locality,
minimize
data
movement

o  full
distribuPon
of
all
query
operators
(where

semanPcally
correct)

•  Parallel
joins:

o  broadcast
join:
join
is
collocated
with
le+
input;
right-‐
hand
side
table
is
broadcast
to
each
node
execuPng
join

-‐>
preferred
for
small
right-‐hand
side
input

o  parPPoned
join:
both
tables
are
hash-‐parPPoned
on
join

columns

-‐>
preferred
for
large
joins

o  cost-‐based
decision
based
on
column
stats/esPmated

cost
of
data
transfers

Query Planning:
Distributed
Plans

•  Parallel
aggregaPon:

o  pre-‐aggregaPon
where
data
is
first
materialized

o  merge
aggregaPon
parPPoned
by
grouping
columns

•  Parallel
top-‐N:

o  iniPal
top-‐N
operaPon
where
data
is
first
materialized

o  final
top-‐N
in
single-‐node
plan
fragment

Query Planning:
Distributed
Plans

•  In
the
example:

o  scans
are
local:
each
scan
receives
its
own
fragment

o  1st
join:
large
x
large
-‐>
parPPoned
join

o  2nd
scan:
large
x
small
-‐>
broadcast
join

o  pre-‐aggregaPon
in
fragment
that
materializes
join
result

o  merge
aggregaPon
a+er
reparPPoning
on
grouping

column

o  iniPal
top-‐N
in
fragment
that
does
merge
aggregaPon

o  ﬁnal
top-‐N
in
coordinator
fragment

Query
Planning:
Distributed
Plans

HashJoinScan: t1
Scan: t3
Scan: t2
HashJoin
TopN
Pre-Agg
MergeAgg
TopN
Broadcast
Broadcast
hash t2.idhash t1.id1
hash
t1.custid
at HDFS DN
at HBase RS
at coordinator

Metadata
Handling

•  Impala
metadata:

o  Hive's
metastore:
logical
metadata
(table
deﬁniPons,

columns,
CREATE
TABLE
parameters)

o  HDFS
NameNode:
directory
contents
and
block
replica

locaPons

o  HDFS
DataNode:
block
replicas'
volume
ids

Metadata
Handling

•  Caches
metadata:
no
synchronous
metastore
API
calls
during

query
execuPon

•  impalad
instances
read
metadata
from
metastore
at
startup

•  Catalog
Service
relays
metadata
when
you
run
DDL
or
update

metadata
on
one
of
Impalad’s

•  REFRESH [<tbl>]:
reloads
metadata
on
all
impalad’s
(if

you
added
new
ﬁles
via
Hive)

•  INVALIDATE METADATA:
reloads
metadata
for
all
tables

•  Roadmap:
HCatalog

Impala
ExecuPon
Engine

•  Wri@en
in
C++
for
minimal
execuPon
overhead

•  Internal
in-‐memory
tuple
format
puts
fixed-‐width

data
at
fixed
offsets

•  Uses
intrinsics/special
cpu
instrucPons
for
text

parsing,
crc32
computaPon,
etc.

•  RunPme
code
generaPon
for
“big
loops”

Impala
ExecuPon
Engine

•  More
on
runPme
code
generaPon

•  example
of
"big
loop":
insert
batch
of
rows
into
hash
table

•  known
at
query
compile
Pme:
#
of
tuples
in
a
batch,
tuple

layout,
column
types,
etc.

•  generate
at
compile
Pme:
unrolled
loop
that
inlines
all

funcPon
calls,
contains
no
dead
code,
minimizes
branches

•  code
generated
using
llvm

Impala's
Statestore

•  Central
system
state
repository

•  name
service
(membership)

•  Metadata

•  Roadmap:
other
scheduling-‐relevant
or
diagnosPc
state

•  So+-‐state

•  all
data
can
be
reconstructed
from
the
rest
of
the
system

•  cluster
conPnues
to
funcPon
when
statestore
fails,
but
per-‐node
state

becomes
increasingly
stale

•  Sends
periodic
heartbeats

•  pushes
new
data

•  checks
for
liveness

Statestore:
Why
not
ZooKeeper?

•  ZK
is
not
a
good
pub-‐sub
system

•  Watch
API
is
awkward
and
requires
a
lot
of
client
logic

•  mulPple
round-‐trips
required
to
get
data
for
changes
to

node's
children

•  push
model
is
more
natural
for
our
use
case

•  Don't
need
all
the
guarantees
ZK
provides:

•  serializability

•  persistence

•  prefer
to
avoid
complexity
where
possible

•  ZK
is
bad
at
the
things
we
care
about
and
good
at
the

things
we
don't

Comparing
Impala
to
Dremel

•  What
is
Dremel?

•  columnar
storage
for
data
with
nested
structures

•  distributed
scalable
aggregaPon
on
top
of
that

•  Columnar
storage
in
Hadoop:
Parquet

•  stores
data
in
appropriate
naPve/binary
types

•  can
also
store
nested
structures
similar
to
Dremel's
ColumnIO

•  Distributed
aggregaPon:
Impala

•  Impala
plus
Parquet:
a
superset
of
the
published
version
of

Dremel
(which
didn't
support
joins)

•  What
is
it:

•  container
format
for
all
popular
serializaPon
formats:
Avro,
Thri+,

Protocol
Buffers

•  successor
to
Trevni

•  jointly
developed
between
Cloudera
and
Twi@er

•  open
source;
hosted
on
github

•  Features

•  rowgroup
format:
file
contains
mulPple
horiz.
slices

•  supports
storing
each
column
in
separate
file

•  supports
fully
shredded
nested
data;
repePPon
and
definiPon
levels

similar
to
Dremel's
ColumnIO

•  column
values
stored
in
naPve
types
(bool,
int<x>,
float,
double,
byte

array)

•  support
for
index
pages
for
fast
lookup

•  extensible
value
encodings

More
about
Parquet

Comparing
Impala
to
Hive

•  Hive:
MapReduce
as
an
execuPon
engine

•  High
latency,
low
throughput
queries

•  Fault-‐tolerance
model
based
on
MapReduce's
on-‐disk

checkpoinPng;
materializes
all
intermediate
results

•  Java
runPme
allows
for
easy
late-‐binding
of
funcPonality:

ﬁle
formats
and
UDFs.

•  Extensive
layering
imposes
high
runPme
overhead

•  Impala:

•  direct,
process-‐to-‐process
data
exchange

•  no
fault
tolerance

•  an
execuPon
engine
designed
for
low
runPme
overhead

Impala
Roadmap:
2013

•  AddiPonal
SQL:

•  ORDER
BY
without
LIMIT

•  AnalyPc
window
funcPons

•  support
for
structured
data
types

•  Improved
HBase
support:

•  composite
keys,
complex
types
in
columns,

index
nested-‐loop
joins,

INSERT/UPDATE/DELETE

Impala
Roadmap:
2013

•  RunPme
opPmizaPons:

•  straggler
handling

•  improved
cache
management

•  data
collocaPon
for
improved
join
performance

•  Resource
management:

•  goal:
run
exploratory
and
producPon
workloads
in
same

cluster,
against
same
data,
w/o
impacPng
producPon
jobs

Demo

•  Uses
Cloudera’s
Quickstart
VM
h@p://Pny.cloudera.com/quick-‐start

Try
it
out!

•  Open
source!
Available
at
cloudera.com,
AWS
EMR!

•  We
have
packages
for:

•  RHEL
5,6,
SLES11,
Ubuntu
Lucid,
Maverick,
Precise,

Debain,
etc.

•  QuesPons/comments?
community.cloudera.com

•  My
twi@er
handle:
mark_grover

•  Slides
at:
github.com/markgrover/impala-‐thug

Impala Architecture presentation

Recommended

More Related Content

What's hot (20)

Similar to Impala Architecture presentation (20)

More from hadooparchbook (20)

Recently uploaded (20)

Impala Architecture presentation