HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

HBase and HDFS: Past, Present,
Future
Todd Lipcon
todd@cloudera.com
Twitter: @tlipcon #hbase IRC: tlipcon

May 22, 2012

Intro / who am I?
Been working on data stuﬀ for a few years
HBase, HDFS, MR committer
Cloudera engineer since March ’09

(a) My posts to hbase-dev (b) My posts to
(core|hdfs|mapreduce)-dev

A
You know I’m an engineer since my slides are ugly and written in LTEX

Framework for discussion
Time periods
Past (Hadoop pre-1.0)
Present (Hadoop 1.x, 2.0)
Future (Hadoop 2.x and later)

Categories
Reliability/Availability
Performance
Feature set

HDFS and HBase History - 2006
Author: Douglass Cutting <cutting@apache.org>
Date: Fri Jan 27 22:19:42 2006 +0000

Create hadoop sub-project.

Author: Douglass Cutting <cutting@apache.org>
Date: Tue Apr 3 20:34:28 2007 +0000

HADOOP-1045. Add contrib/hbase, a
BigTable-like online database.

Author: Jim Kellerman <jimk@apache.org>
Date: Tue Feb 5 02:36:26 2008 +0000

2008/02/04 HBase is now a subproject of Hadoop.
The first HBase release as a subproject will be
release 0.1.0 which will be equivalent to the
version of HBase included in Hadoop 0.16.0...

HDFS and HBase History - Early 2010
HBase has been around for 3 years, But HDFS still
acts like MapReduce is the only important client! §

People have accused HDFS of being like a molasses train:
high throughput but not so fast

HBase becomes a top-level project
Facebook chooses HBase for Messages product
Jump from HBase 0.20 to HBase 0.89 and 0.90
First CDH3 betas include HBase
HDFS community starts to work on features
for HBase.
Infamous hadoop-0.20-append branch

What did we get done?
And where are we going?

Reliability in the past: Hadoop 1.0
Pre-1.0, if the DN crashed, HBase would lose
its WALs (and your beloved data).
1.0 integrated hadoop-0.20-append branch into
a main-line release
True durability support for HBase
We have a ﬁghting chance at metadata reliability!

Numerous bug ﬁxes for write pipeline recovery
and other error paths
HBase is not nearly so forgiving as MapReduce!
“Single-writer” fault tolerance vs “job-level” fault
tolerance

Reliability in the past: Hadoop 1.0
Pre-1.0: if any disk failed, entire DN would go
oﬄine
Problematic for HBase: local RS would lose all
locality!
1.0: per-disk failure detection in DN
(HDFS-457)
Allows HBase to lose a disk without losing all
locality
Tip: Conﬁgure
dfs.datanode.failed.volumes.tolerated = 1

Reliability today: Hadoop 2.0
Integrates Highly Available HDFS
Active-standby hot failover removes SPOF
Transparent to clients: no HBase changes
necessary
Tested extensively under HBase read/write
workloads
Coupled with HBase master failover, no more
HBase SPOF!

Reliability in the future: HA in 2.x
Remove dependency on NFS (HDFS-3077)
Quorum-commit protocol for NameNode edit logs
Similar to ZAB/Multi-Paxos

Automatic failover for HA NameNodes
(HDFS-3042)
ZooKeeper-based master election, just like HBase
Merge to trunk should be this week.

Other reliability work for HDFS 2.x
2.0: current hflush() API only guarantees
data is replicated to three machines – not fully
on disk.
A cluster-wide power outage can lose data.
Upcoming in 2.x: Support for hsync()
(HDFS-744, HBASE-5954)
Calls fsync() for all replicas of the WAL
Full durability of edits, even with full cluster
power outages

HDFS wire compatibility in Hadoop 2.0
In 1.0: HDFS client version must match server
version closely.
How many of you have manually copied HDFS
client jars?
Client-server compatibility in 2.0:
Protobuf-based RPC
Easier HBase installs: no more futzing with jars
Separate HBase upgrades from HDFS
upgrades
Intra-cluster server compatibility in the works
Allow for rolling upgrade without downtime

Performance: Hadoop 1.0
Pre-1.0: even for reads from local machine,
client connects to DN via TCP
1.0: Short-circuit local reads
Obtains direct access to underlying local block file,
then uses regular FileInputStream access.
2x speedup for random reads

Configure dfs.client.read.shortcircuit = true
Configure dfs.block.local-path-access.user = hbase
Configure dfs.datanode.data.dir.perm = 755
Currently does not support security §

Performance: Hadoop 2.0
Pre-2.0: Up to 50% CPU spent verifying CRC
2.0: Native checksums using SSE4.2 crc32
asm (HDFS-2080)
2.5x speedup reading from buﬀer cache
Now only 15% CPU overhead to checksumming
Pre-2.0: re-establishes TCP connection to DN
for each seek
2.0: Rewritten BlockReader, keepalive to DN
(HDFS-941)
40% improvement on random read for HBase
2-2.5x in micro-benchmarks
Total improvement vs 0.20.2: 3.4x!

Performance: Hadoop 2.x
Currently: lots of CPU spent copying data in
memory
“Direct-read” API: read directly into
user-provided DirectByteBuﬀers (HDFS-2834)
Another ˜2x improvement to sequential
throughput reading from cache
Opportunity to avoid two more buﬀer copies
reading compressed data (HADOOP-8148)
Codec APIs still in progress, needs integration into
HBase

Performance: Hadoop 2.x
True “zero-copy read” support (HDFS-3051)
New API would allow direct access to mmaped
block ﬁles
No syscall or JNI overhead for reads
Initial benchmarks indicate at least ˜30% gain.
Some open questions around best safe
implementation

Performance: why emphasize CPU?
Machines with lots of RAM now inexpensive
(48-96GB common)
Want to use that to improve cache hit ratios.
Unfortunately, 50GB+ Java heaps still
impractical (GC pauses too long)
Allocate the extra RAM to the buﬀer cache
OS caches compressed data: another win!
CPU overhead reading from buﬀer cache
becomes limiting factor for read workloads

What’s up next in 2.x?
HDFS Hard-links (HDFS-3370)
Will allow for HBase to clone/snapshot tables
eﬃciently!
Improves HBase table-scoped backup story

HDFS Snapshots (HDFS-2802)
HBase-wide snapshot support for point-in-time
recovery
Enables consistent backups copied oﬀ-site for DR

Improved block placement policies
(HDFS-1094)
Fundamental tradeoﬀ between probability of data
unvailability and the amount of data that becomes
unavailable
Current scheme: if any 3 nodes not on the same
rack die, some very small amount of data is
unavailable
Proposed scheme: lessen chances of unavailability,
but if a certain three nodes die, a larger amount is
unavailable
For many HBase applications: any single lost block
halts whole operation. Prefer to minimize
probability.

HBase-speciﬁc block placement hints
(HBASE-4755)
Assign each region a set of three RS (primary and
two backups)
Place underlying data blocks on these three DNs
Could then fail-over and load-balance without
losing any locality!

Summary

Hadoop 1.0 Hadoop 2.0 Hadoop 2.x
Availability - DN volume - NameNode HA - HA without NAS
failure isolation - Wire Compat - Rolling upgrade
Performance - Short-circuit - Native CRC - Direct-read API
reads - DN Keepalive - Zero-copy API
- Direct codec API
Features - durable hflush() - hsync()
- Snapshots
- Hard links
- HBase-aware block
placement

Summary
HBase is no longer a second-class citizen.
We’ve come a long way since Hadoop 0.20.2 in
performance, reliability, and availability.
New features coming in the 2.x line specifically
to benefit HBase use cases
Hadoop 2.0 features available today via CDH4 beta.
Several Cloudera customers already using CDH4b2
with HBase with great success.
Official Hadoop 2.0 release and CDH4 GA coming
soon.

Questions?

todd@cloudera.com
Twitter: @tlipcon
#hbase IRC: tlipcon

P.S. we’re hiring!

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera (20)

More from Cloudera, Inc. (20)

Recently uploaded (20)

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera