0% found this document useful (0 votes)
369 views

Tutorial Hbase

This document provides an introduction to HBase, an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. It discusses the limitations of relational databases in handling big data workloads and introduces column-oriented databases as an alternative. Specifically, it uses the example of a URL shortening service to illustrate how the data model and queries would differ between a traditional RDBMS and HBase. Key-value access, flexibility in schema, and horizontal scaling without sharding are cited as advantages HBase provides over relational databases for large, distributed datasets.

Uploaded by

Ana Larissa Dias
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
369 views

Tutorial Hbase

This document provides an introduction to HBase, an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable. It discusses the limitations of relational databases in handling big data workloads and introduces column-oriented databases as an alternative. Specifically, it uses the example of a URL shortening service to illustrate how the data model and queries would differ between a traditional RDBMS and HBase. Key-value access, flexibility in schema, and horizontal scaling without sharding are cited as advantages HBase provides over relational databases for large, distributed datasets.

Uploaded by

Ana Larissa Dias
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Tutorial: HBase

Theory and Practice of a Distributed Data Store


Pietro Michiardi
Eurecom
Pietro Michiardi (Eurecom) Tutorial: HBase 1 / 100
Introduction
Introduction
Pietro Michiardi (Eurecom) Tutorial: HBase 2 / 100
Introduction RDBMS
Why yet another storage architecture?
Relational Databse Management Systems (RDBMS):

Around since 1970s

Countless examples in which they actually do make sense


The dawn of Big Data:

Previously: ignore data sources because no cost-effective way to


store everything

One option was to prune, by retaining only data for the last N days

Today: store everything!

Pruning fails in providing a base to build useful mathematical models


Pietro Michiardi (Eurecom) Tutorial: HBase 3 / 100
Introduction RDBMS
Batch processing
Hadoop and MapReduce:

Excels at storing (semi- and/or un-) structured data

Data interpretation takes place at analysis-time

Flexibility in data classication


Batch processing: A complement to RDBMS

Scalable sink for data, processing launched when time is right

Optimized for large le storage

Optimized for streaming access


Random Access:

Users need to interact with data, especially that crunched after a


MapReduce job

This is historically where RDBMS excel: random access for


structured data
Pietro Michiardi (Eurecom) Tutorial: HBase 4 / 100
Introduction Column-Oriented DB
Column-Oriented Databases
Data layout:

Save their data grouped by columns

Subsequent column values are stored contiguously on disk

This is substantially different from traditional RDBMS, which save


and store data by row
Specialized databases for specic workloads:

Reduced I/O

Better suited for compression Efcient use of bandwidth

Indeed, column values are often very similar and differ little
row-by-row

Real-time access to data


Important NOTE:

HBase is not a column-oriented DB in the typical term

HBase uses an on-disk column storage format

Provides key-based access to specic cell of data, or a sequential


range of cells
Pietro Michiardi (Eurecom) Tutorial: HBase 5 / 100
Introduction Column-Oriented DB
Column-Oriented and Row-Oriented storage layouts
stoiing this uata ovei a ceitain numLei ol months cieates a huge tail that neeus to Le
hanuleu elliciently. Even though laigei paits ol emailsloi example, attachments
aie stoieu in a seconuaiy system,
'
the amount ol uata geneiateu Ly all these messages
is minu-Loggling. Il we weie to take 1+0 Lytes pei message, as useu Ly Twittei, it woulu
Iigurc 1-1. Co|unn-oricntcd and row-oricntcd storagc |ayouts
=See this Llog post, as well as this one, Ly the FaceLook engineeiing team. Vall messages count loi 15 Lillion
anu chat loi 120 Lillion, totaling 135 Lillion messages a month. Then they also auu SMS anu otheis to cieate
an even laigei numLei.
' FaceLook uses Haystack, which pioviues an optimizeu stoiage inliastiuctuie loi laige Linaiy oLjects, such
as photos.
4 | Chapter 1:Introduction
Figure: Example of Storage Layouts
Pietro Michiardi (Eurecom) Tutorial: HBase 6 / 100
Introduction The problem with RDBMS
The Problem with RDBMS
RDBMS are still relevant

Persistence layer for frontend application

Store relational data

Works well for a limited number of records


Example: Hush

Used throughout this course

URL shortener service


Lets see the scalability story of such a service

Assumption: service must run with a reasonable budget


Pietro Michiardi (Eurecom) Tutorial: HBase 7 / 100
Introduction The problem with RDBMS
The Problem with RDBMS
Few thousands users: use a LAMP stack

Normalize data

Use foreign keys

Use Indexes
The system also uownloaus the linkeu page in the Lackgiounu, anu extiacts, loi in-
stance, the TITLE tag liom the HTML, il piesent. The entiie page is saveu loi latei
piocessing with asynchionous Latch joLs, loi analysis puiposes. This is iepiesenteu Ly
the url taLle.
Eveiy linkeu page is only stoieu once, Lut since many useis may link to the same long
URL, yet want to maintain theii own uetails, such as the usage statistics, a sepaiate
entiy in the shorturl is cieateu. This links the url, shorturl, anu click taLles.
This also allows you to aggiegate statistics to the oiiginal shoit ID, refShortId, so that
you can see the oveiall usage ol any shoit URL to map to the same long URL. The
shortId anu refShortId aie the hasheu IDs assigneu uniguely to each shoiteneu URL.
Foi example, in
http://hush.li/a23eg
the ID is a23eg.
Figuie 1-3 shows how the same schema coulu Le iepiesenteu in HBase. Eveiy shoiteneu
URL is stoieu in a sepaiate taLle, shorturl, which also contains the usage statistics,
stoiing vaiious time ianges in sepaiate column lamilies, with uistinct tinc-to-|ivc
settings. The columns loim the actual counteis, anu theii name is a comLination ol the
uate, plus an optional uimensional postlixloi example, the countiy coue.
The uownloaueu page, anu the extiacteu uetails, aie stoieu in the url taLle. This taLle
uses compiession to minimize the stoiage ieguiiements, Lecause the pages aie mostly
HTML, which is inheiently veiLose anu contains a lot ol text.
The user-shorturl taLle acts as a lookup so that you can guickly linu all shoit IDs loi
a given usei. This is useu on the usei`s home page, once she has loggeu in. The user
taLle stoies the actual usei uetails.
Ve still have the same numLei ol taLles, Lut theii meaning has changeu: the clicks
taLle has Leen aLsoiLeu Ly the shorturl taLle, while the statistics columns use the uate
as theii key, loimatteu as YYYYMMDDloi instance, 20110502so that they can Le ac-
Iigurc 1-2. Thc Hush schcna cxprcsscd as an ERD
14 | Chapter 1:Introduction
Figure: The Hush Schema expressed as an ERD
Pietro Michiardi (Eurecom) Tutorial: HBase 8 / 100
Introduction The problem with RDBMS
The Problem with RDBMS
Find all short URLs for a given user

JOIN user and shorturl tables

Use the WHERE clause to select the given user


Stored Procedures

Consistently update data from multiple clients

Underlying DB system guarantees coherency


Transactions

Make sure you can update tables in an atomic fashion

RDBMS Strong Consistency (ACID properties)

Referential Integrity
Pietro Michiardi (Eurecom) Tutorial: HBase 9 / 100
Introduction The problem with RDBMS
The Problem with RDBMS
Scaling up to tens of thousands of users

Increasing pressure on the database server

Adding more application servers is easy: they share their state on


the same central DB

CPU and I/O start to be a problem on the DB


Master-Slave architecture

Add DB server so that READS can be served in parallel

Master DB takes all the writes (which are fewer in the Hush
application)

Slaves DB replicate Master DB and serve all reads (but you need a
load balancer)
Pietro Michiardi (Eurecom) Tutorial: HBase 10 / 100
Introduction The problem with RDBMS
The Problem with RDBMS
Scaling up to hundreds of thousands

READS are still the bottlenecks

Slave servers begin to fall short in serving clients requests


Caching

Add a caching layer, e.g. Memcached or Redis

Ofoad READS to a fast in-memory system


You lose consistency guarantees
Cache invalidation is critical for having DB and Caching layer
consistent
Pietro Michiardi (Eurecom) Tutorial: HBase 11 / 100
Introduction The problem with RDBMS
The Problem with RDBMS
Scaling up more

WRITES are the bottleneck

The master DB is hit too hard by WRITE load

Vertical scalability: beef up your master server


This becomes costly, as you may also have to replace your RDBMS
SQL JOINs becomes a bottleneck

Schema de-normalization

Cease using stored procedures, as they become slow and eat up a


lot of server CPU

Materialized views (they speed up READS)

Drop secondary indexes as they slow down WRITES


Pietro Michiardi (Eurecom) Tutorial: HBase 12 / 100
Introduction The problem with RDBMS
The Problem with RDBMS
What if your application needs to further scale up?

Vertical scalability vs. Horizontal scalability


Sharding

Partition your data across multiple databases

Essentially you break horizontally your tables and ship them to


different servers

This is done using xed boundaries


Re-sharding to achieve load-balancing
This is an operational nightmare

Re-sharding takes a huge toll on I/O resources


Pietro Michiardi (Eurecom) Tutorial: HBase 13 / 100
Introduction NOSQL
Non-Relational DataBases
They originally do not support SQL

In practice, this is becoming a thin line to make the distinction

One difference is in the data model

Another difference is in the consistency model (ACID and


transactions are generally sacriced)
Consistency models and the CAP Theorem

Strict: all changes to data are atomic

Sequential: changes to data are seen in the same order as they


were applied

Causal: causally related changes are seen in the same order

Eventual: updates propagates through the system and replicas


when in steady state

Weak: no guarantee
Pietro Michiardi (Eurecom) Tutorial: HBase 14 / 100
Introduction NOSQL
Dimensions to classify NoSQL DBs
Data model

How the data is stored: key/value, semi-structured,


column-oriented, ...

How to access data?

Can the schema evolve over time?


Storage model

In-memory or persistent?

How does this affect your access pattern?


Consistency model

Strict or eventual?

This translates in how fast the system handles READS and WRITES
[2]
Pietro Michiardi (Eurecom) Tutorial: HBase 15 / 100
Introduction NOSQL
Dimensions to classify NoSQL DBs
Physical Model

Distributed or single machine?

How does the system scale?


Read/Write performance

Top-down approach: understands well the workload!

Some systems are better for READS, other for WRITES


Secondary indexes

Does your workload require them?

Can your system emulate them?


Pietro Michiardi (Eurecom) Tutorial: HBase 16 / 100
Introduction NOSQL
Dimensions to classify NoSQL DBs
Failure Handling

How each data store handle server failures?

Is it able to continue operating in case of failures?

This is related to Consistency models and the CAP theorem

Does the system support hot-swap?


Compression

Is the compression method pluggable?

What type of compression?


Load Balancing

Can the storage system seamlessly balance load?


Pietro Michiardi (Eurecom) Tutorial: HBase 17 / 100
Introduction NOSQL
Dimensions to classify NoSQL DBs
Atomic read-modify-write

Easy in a centralized system, difcult in a distributed one

Prevent race conditions in multi-threaded or shared-nothing designs

Can reduce client-side complexity


Locking, waits and deadlocks

Support for multiple client accessing data simultaneously

Is locking available?

Is it wait-free, hence deadlock free?


Impedance Match
One-size-ts-all has been long dismissed: need to nd the perfect
match for your problem.
Pietro Michiardi (Eurecom) Tutorial: HBase 18 / 100
Introduction Denormalization
Database (De-)Normalization
Schema design at scale

A good methodology is to apply the DDI principle [8]

Denormalization

Duplication

Intelligent Key design


Denormalization

Duplicate data in more than one table such that at READ time no
further aggregation is required
Next: an example based on Hush

How to convert a classic relational data model to one that ts


HBase

This example will be covered in the LAB session 3


Pietro Michiardi (Eurecom) Tutorial: HBase 19 / 100
Introduction Denormalization
Example: Hush - from RDBMS to HBase
The system also uownloaus the linkeu page in the Lackgiounu, anu extiacts, loi in-
stance, the TITLE tag liom the HTML, il piesent. The entiie page is saveu loi latei
piocessing with asynchionous Latch joLs, loi analysis puiposes. This is iepiesenteu Ly
the url taLle.
Eveiy linkeu page is only stoieu once, Lut since many useis may link to the same long
URL, yet want to maintain theii own uetails, such as the usage statistics, a sepaiate
entiy in the shorturl is cieateu. This links the url, shorturl, anu click taLles.
This also allows you to aggiegate statistics to the oiiginal shoit ID, refShortId, so that
you can see the oveiall usage ol any shoit URL to map to the same long URL. The
shortId anu refShortId aie the hasheu IDs assigneu uniguely to each shoiteneu URL.
Foi example, in
http://hush.li/a23eg
the ID is a23eg.
Figuie 1-3 shows how the same schema coulu Le iepiesenteu in HBase. Eveiy shoiteneu
URL is stoieu in a sepaiate taLle, shorturl, which also contains the usage statistics,
stoiing vaiious time ianges in sepaiate column lamilies, with uistinct tinc-to-|ivc
settings. The columns loim the actual counteis, anu theii name is a comLination ol the
uate, plus an optional uimensional postlixloi example, the countiy coue.
The uownloaueu page, anu the extiacteu uetails, aie stoieu in the url taLle. This taLle
uses compiession to minimize the stoiage ieguiiements, Lecause the pages aie mostly
HTML, which is inheiently veiLose anu contains a lot ol text.
The user-shorturl taLle acts as a lookup so that you can guickly linu all shoit IDs loi
a given usei. This is useu on the usei`s home page, once she has loggeu in. The user
taLle stoies the actual usei uetails.
Ve still have the same numLei ol taLles, Lut theii meaning has changeu: the clicks
taLle has Leen aLsoiLeu Ly the shorturl taLle, while the statistics columns use the uate
as theii key, loimatteu as YYYYMMDDloi instance, 20110502so that they can Le ac-
Iigurc 1-2. Thc Hush schcna cxprcsscd as an ERD
14 | Chapter 1:Introduction
Figure: The Hush Schema expressed as an ERD
shorturl table: contains the short URL
click table: contains click tracking, and other statistics,
aggregated on a daily basis (essentially, a counter)
user table: contains user information
URL table: contains a replica of the page linked to a short URL,
including META data and content (this is done for batch analysis
purposes)
Pietro Michiardi (Eurecom) Tutorial: HBase 20 / 100
Introduction Denormalization
Example: Hush - from RDBMS to HBase
The system also uownloaus the linkeu page in the Lackgiounu, anu extiacts, loi in-
stance, the TITLE tag liom the HTML, il piesent. The entiie page is saveu loi latei
piocessing with asynchionous Latch joLs, loi analysis puiposes. This is iepiesenteu Ly
the url taLle.
Eveiy linkeu page is only stoieu once, Lut since many useis may link to the same long
URL, yet want to maintain theii own uetails, such as the usage statistics, a sepaiate
entiy in the shorturl is cieateu. This links the url, shorturl, anu click taLles.
This also allows you to aggiegate statistics to the oiiginal shoit ID, refShortId, so that
you can see the oveiall usage ol any shoit URL to map to the same long URL. The
shortId anu refShortId aie the hasheu IDs assigneu uniguely to each shoiteneu URL.
Foi example, in
http://hush.li/a23eg
the ID is a23eg.
Figuie 1-3 shows how the same schema coulu Le iepiesenteu in HBase. Eveiy shoiteneu
URL is stoieu in a sepaiate taLle, shorturl, which also contains the usage statistics,
stoiing vaiious time ianges in sepaiate column lamilies, with uistinct tinc-to-|ivc
settings. The columns loim the actual counteis, anu theii name is a comLination ol the
uate, plus an optional uimensional postlixloi example, the countiy coue.
The uownloaueu page, anu the extiacteu uetails, aie stoieu in the url taLle. This taLle
uses compiession to minimize the stoiage ieguiiements, Lecause the pages aie mostly
HTML, which is inheiently veiLose anu contains a lot ol text.
The user-shorturl taLle acts as a lookup so that you can guickly linu all shoit IDs loi
a given usei. This is useu on the usei`s home page, once she has loggeu in. The user
taLle stoies the actual usei uetails.
Ve still have the same numLei ol taLles, Lut theii meaning has changeu: the clicks
taLle has Leen aLsoiLeu Ly the shorturl taLle, while the statistics columns use the uate
as theii key, loimatteu as YYYYMMDDloi instance, 20110502so that they can Le ac-
Iigurc 1-2. Thc Hush schcna cxprcsscd as an ERD
14 | Chapter 1:Introduction
Figure: The Hush Schema expressed as an ERD
user table is indexed on the username eld, for fast user lookup
shorturl table is indexed on the short URL (shortId) eld, for
fast short URL lookup
Pietro Michiardi (Eurecom) Tutorial: HBase 21 / 100
Introduction Denormalization
Example: Hush - from RDBMS to HBase
The system also uownloaus the linkeu page in the Lackgiounu, anu extiacts, loi in-
stance, the TITLE tag liom the HTML, il piesent. The entiie page is saveu loi latei
piocessing with asynchionous Latch joLs, loi analysis puiposes. This is iepiesenteu Ly
the url taLle.
Eveiy linkeu page is only stoieu once, Lut since many useis may link to the same long
URL, yet want to maintain theii own uetails, such as the usage statistics, a sepaiate
entiy in the shorturl is cieateu. This links the url, shorturl, anu click taLles.
This also allows you to aggiegate statistics to the oiiginal shoit ID, refShortId, so that
you can see the oveiall usage ol any shoit URL to map to the same long URL. The
shortId anu refShortId aie the hasheu IDs assigneu uniguely to each shoiteneu URL.
Foi example, in
http://hush.li/a23eg
the ID is a23eg.
Figuie 1-3 shows how the same schema coulu Le iepiesenteu in HBase. Eveiy shoiteneu
URL is stoieu in a sepaiate taLle, shorturl, which also contains the usage statistics,
stoiing vaiious time ianges in sepaiate column lamilies, with uistinct tinc-to-|ivc
settings. The columns loim the actual counteis, anu theii name is a comLination ol the
uate, plus an optional uimensional postlixloi example, the countiy coue.
The uownloaueu page, anu the extiacteu uetails, aie stoieu in the url taLle. This taLle
uses compiession to minimize the stoiage ieguiiements, Lecause the pages aie mostly
HTML, which is inheiently veiLose anu contains a lot ol text.
The user-shorturl taLle acts as a lookup so that you can guickly linu all shoit IDs loi
a given usei. This is useu on the usei`s home page, once she has loggeu in. The user
taLle stoies the actual usei uetails.
Ve still have the same numLei ol taLles, Lut theii meaning has changeu: the clicks
taLle has Leen aLsoiLeu Ly the shorturl taLle, while the statistics columns use the uate
as theii key, loimatteu as YYYYMMDDloi instance, 20110502so that they can Le ac-
Iigurc 1-2. Thc Hush schcna cxprcsscd as an ERD
14 | Chapter 1:Introduction
Figure: The Hush Schema expressed as an ERD
shorturl and user tables are related through a foreign key
relation on the userId
URL table is related to shorturl table with a foreign key on the
URL id
click table is related to shorturl table with a foreign key on
the short URL id
NOTE: a web page is stored only once (even if multiple users link
to it), but each users maintain separate statistics
Pietro Michiardi (Eurecom) Tutorial: HBase 22 / 100
Introduction Denormalization
Example: Hush - from RDBMS to HBase
cesseu seguentially. The auuitional user-shorturl taLle is ieplacing the loieign key
ielationship, making usei-ielateu lookups lastei.
Theie aie vaiious appioaches to conveiting onc-to-onc, onc-to-nany, anu nany-to-
nany ielationships to lit the unueilying aichitectuie ol HBase. You coulu implement
even this simple example in uilleient ways. You neeu to unueistanu the lull potential
ol HBase stoiage uesign to make an euucateu uecision iegaiuing which appioach to
take.
The suppoit loi spaise, wiue taLles anu column-oiienteu uesign olten eliminates the
neeu to noimalize uata anu, in the piocess, the costly JOIN opeiations neeueu to
aggiegate the uata at gueiy time. Use ol intelligent keys gives you line-giaineu contiol
ovei howanu wheieuata is stoieu. Paitial key lookups aie possiLle, anu when
Iigurc 1-3. Thc Hush schcna in HBasc
Nonrelational Database Systems, Not-Only SQL or NoSQL? | 15
Figure: The Hush Schema in HBase
shorturl table: stores each
short URL, usage statistics
(various time-ranges in
separate column-families with
distinct TTL settings)

Note the dimensional postx


appended to the time
information
url table: stores the
downloaded page, and the
extracted details

This table uses compression


Pietro Michiardi (Eurecom) Tutorial: HBase 23 / 100
Introduction Denormalization
Example: Hush - from RDBMS to HBase
cesseu seguentially. The auuitional user-shorturl taLle is ieplacing the loieign key
ielationship, making usei-ielateu lookups lastei.
Theie aie vaiious appioaches to conveiting onc-to-onc, onc-to-nany, anu nany-to-
nany ielationships to lit the unueilying aichitectuie ol HBase. You coulu implement
even this simple example in uilleient ways. You neeu to unueistanu the lull potential
ol HBase stoiage uesign to make an euucateu uecision iegaiuing which appioach to
take.
The suppoit loi spaise, wiue taLles anu column-oiienteu uesign olten eliminates the
neeu to noimalize uata anu, in the piocess, the costly JOIN opeiations neeueu to
aggiegate the uata at gueiy time. Use ol intelligent keys gives you line-giaineu contiol
ovei howanu wheieuata is stoieu. Paitial key lookups aie possiLle, anu when
Iigurc 1-3. Thc Hush schcna in HBasc
Nonrelational Database Systems, Not-Only SQL or NoSQL? | 15
Figure: The Hush Schema in HBase
user-shorturl table: this is
a lookup table (basically an
index) to nd all shortIDs for a
given user

Note that this table is lled


at insert time, its not
automatically generated by
HBase
user table: stores user details
Pietro Michiardi (Eurecom) Tutorial: HBase 24 / 100
Introduction Denormalization
Example: Hush - RDBMS vs HBase
Same number of tables

Their meaning is different

click table has been absorbed by the shorturl table

statistics are stored with the date as the key, so that they can be
accessed sequentially

The user-shorturl table is replacing the foreign key


relationship, making user-related lookups faster
Normalized vs. De-normalized data

Wide tables and column-oriented design eliminates JOINs

Compound keys are essential

Data partitioning is based on keys, so a proper understanding


thereof is essential
Pietro Michiardi (Eurecom) Tutorial: HBase 25 / 100
Introduction HBase Sketch
HBase building blocks
The backdrop: BigTable

GFS, The Google FileSystem [6]

Google MapReduce [4]

BigTable [3]
What is BigTable?

BigTable is a distributed storage system for managing structured


data designed to scale to a very large size

BigTable is a sparse, distributed, persistent multi-dimensional


sorted map
What is HBase?

Essentially its an open-source version of BigTable

Differences listed in [5]


Pietro Michiardi (Eurecom) Tutorial: HBase 26 / 100
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
The most basic unit in HBase is a column

Each column may have multiple versions, with each distinct value
contained in a separate cell

One or more columns form a row, that is addressed uniquely by a


row key
A table is a collection of rows

All rows are always sorted lexicographically by their row key


ovei laigei key ianges oi entiie taLles. The culmination ol these elloits was puLlisheu
in 2006 in a papei titleu Bigtab|c: A Distributcd Storagc Systcn jor Structurcd Data,
two exceipts liom which lollow:
BigtaLle is a uistiiLuteu stoiage system loi managing stiuctuieu uata that is uesigneu to
scale to a veiy laige size: petaLytes ol uata acioss thousanus ol commouity seiveis.
.a spaise, uistiiLuteu, peisistent multi-uimensional soiteu map.
It is highly iecommenueu that eveiyone inteiesteu in HBase ieau that papei. It uesciiLes
a lot ol ieasoning Lehinu the uesign ol BigtaLle anu, ultimately, HBase. Ve will, how-
evei, go thiough the Lasic concepts, since they apply uiiectly to the iest ol this Look.
HBase is implementing the BigtaLle stoiage aichitectuie veiy laithlully so that we can
explain eveiything using HBase. Appenuix F pioviues an oveiview ol wheie the two
systems uillei.
Tables, Rows, Columns, and Cells
Fiist, a guick summaiy: the most Lasic unit is a co|unn. One oi moie columns loim a
row that is auuiesseu uniguely Ly a row |cy. A numLei ol iows, in tuin, loim a tab|c,
anu theie can Le many ol them. Each column may have multiple veisions, with each
uistinct value containeu in a sepaiate cc||.
This sounus like a ieasonaLle uesciiption loi a typical uataLase, Lut with the extia
dincnsion ol allowing multiple veisions ol each cells. But oLviously theie is a Lit moie
to it.
All rows aie always soiteu lexicogiaphically Ly theii iow key. Example 1-1 shows how
this will look when auuing a lew iows with uilleient keys.
Exanp|c 1-1. Thc sorting oj rows donc |cxicographica||y by thcir |cy
hbase(main):001:0> scan 'table1'
ROW COLUMN+CELL
row-1 column=cf1:, timestamp=1297073325971 ...
row-10 column=cf1:, timestamp=1297073337383 ...
row-11 column=cf1:, timestamp=1297073340493 ...
row-2 column=cf1:, timestamp=1297073329851 ...
row-22 column=cf1:, timestamp=1297073344482 ...
row-3 column=cf1:, timestamp=1297073333504 ...
row-abc column=cf1:, timestamp=1297073349875 ...
7 row(s) in 0.1100 seconds
Note how the numLeiing is not in seguence as you may have expecteu it. You may have
to pau keys to get a piopei soiting oiuei. In lexicogiaphical soiting, each key is com-
paieu on a Linaiy level, Lyte Ly Lyte, liom lelt to iight. Since row-1... is less than
row-2..., no mattei what lollows, it is soiteu liist.
Having the iow keys always soiteu can give you something like a piimaiy key inuex
known liom RDBMSes. It is also always unigue, that is, you can have each iow key
Building Blocks | 17
Pietro Michiardi (Eurecom) Tutorial: HBase 27 / 100
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
Lexicographical ordering of row keys

Keys are compared on a binary level, byte by byte, from left to right

This can be thought of as a primary index on the row key!

Row keys are always unique

Row keys can be any arbitrary array of bytes


Columns

Rows are composed of columns

Can have millions of columns

Can be compressed or tagged to stay in memory


Pietro Michiardi (Eurecom) Tutorial: HBase 28 / 100
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
Column Families

Columns are grouped into column families


Semantical boundaries between data

Column families and columns stored together in the same low-level


storage le, called an HFile

Dened when table is created

Should not be changed too often

The number of column families should be reasonable [WHY?]

Column family name composed by printable characters


References to columns

Column name is called qualifier, and can be any arbitrary


number of bytes

Reference: family:qualifier (also called the column key)


Pietro Michiardi (Eurecom) Tutorial: HBase 29 / 100
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
A note on the NULL value

In RDBMS NULL cells need to be set and occupy space

In HBase, NULL cells or columns are simply not stored


A cell

Every column value, or cell, is timestamped (implicitly or explicitly)

This can be used to save multiple versions of a value that changes


over time

Versions are stored in decreasing timestamp, most recent rst

Cell versions can be constrained by predicate deletions

Keep only values from the last week


Pietro Michiardi (Eurecom) Tutorial: HBase 30 / 100
Introduction HBase Sketch
HBase building blocks
Tables, Rows, Columns, and Cells
Access to data

(Table, RowKey, Family, Column, Timestamp)


Value

SortedMap<RowKey, List<SortedMap<Column,
List<Value, Timestamp>>>>

The rst SortedMap is the table, containing a List of column


families

The families contain another SortedMap, representing columns


and a List of value, timestamp tuples
A note on consistency:

Row data access is atomic and includes any number of columns

There is no further guarantee or transactional feature spanning


multiple rows
HBase is strictly consistent
Pietro Michiardi (Eurecom) Tutorial: HBase 31 / 100
Introduction HBase Sketch
HBase building blocks
Automatic Sharding
Region

This is the basic unit of scalability and load balancing

Regions are contiguous ranges of rows stored together they


are the equivalent of range partitions in sharded RDBMS

Regions are dynamically split by the system when they become too
large

Regions can also be merged to reduce the number of storage les


Regions in practice

Initially, there is one region

System monitors region size: if a threshold is attained, SPLIT

Regions are split in two at the middle key

This creates roughly two equivalent (in size) regions


Pietro Michiardi (Eurecom) Tutorial: HBase 32 / 100
Introduction HBase Sketch
HBase building blocks
Automatic Sharding
Region Servers

Each region is served by exactly one Region Server

Region servers can serve multiple regions

The number of region servers and their sizes depend on the


capability of a single region server
Server failures

Regions allow for fast recovery upon failure

Fine-grained Load Balancing is also achieved using regions as they


can be easily moved across servers
Pietro Michiardi (Eurecom) Tutorial: HBase 33 / 100
Introduction HBase Sketch
HBase building blocks
Storage API
No support for SQL

CRUD operations using a standard API, available for many clients

Data access is not declarative but imperative


Scan API

Allows for fast iteration over ranges of rows

Allows to limit the number and which column are returned

Allows to control the version number of each cell


Read-modify-write API

HBase supports single-row transactions

Atomic read-modify-write on data stored in a single row key


Pietro Michiardi (Eurecom) Tutorial: HBase 34 / 100
Introduction HBase Sketch
HBase building blocks
Storage API
Counters

Values can be interpreted as counters and updated atomically

Can be read and modied in one operation


Implement global, strictly consistent, sequential counters
Coprocessors

These are equivalent to stored-procedures in RDBMS

Allow to push user code in the address space of the server

Access to server local data

Implement lightweight batch jobs, data pre-processing, data


summarization
Pietro Michiardi (Eurecom) Tutorial: HBase 35 / 100
Introduction HBase Sketch
HBase building blocks
HBase implementation
Data Storage

Store les are called HFiles

Persistent and ordered immutable maps from key to value

Internally implemented as sequences of blocks with an index at the


end

Index is loaded when the HFile is opened and kept in memory


Data lookups

Since HFiles have a block index, lookup can be done with a single
disk seek

First, the block possibly containing a given lookup key is determined


with a binary search in the in-memory index

Then a block read is performed to nd the actual key


Underlying le system

Many are supported, usually HBase deployed on top of HDFS


Pietro Michiardi (Eurecom) Tutorial: HBase 36 / 100
Introduction HBase Sketch
HBase building blocks
HBase implementation
WRITE operation

First, data is written to a commit log, called WAL (write-ahead-log)

Then data is moved into memory, in a structure called memstore

When the size of the memstore exceeds a given threshold it is


ushed to an HFile to disk
How can HBase write, while serving READS and WRITES?

Rolling mechanism

new/empty slots in the memstore take the updates

old/full slots are ushed to disk

Note that data in memstore is sorted by keys, matching what


happens in the HFiles
Data Locality

Achieved by the system looking up for server hostnames

Achieved through intelligent key design


Pietro Michiardi (Eurecom) Tutorial: HBase 37 / 100
Introduction HBase Sketch
HBase building blocks
HBase implementation
Deleting data

Since HFiles are immutable, how can we delete data?

A delete marker (also known as tombstone marker ) is written to


indicate that a given key is deleted

During the read process, data marked as deleted is skipped

Compactions (see next slides) nalize the deletion process


READ operation

Merge of what is stored in the memstores (data that is not on disk)


and in the HFiles

The WAL is never used in the READ operation

Several API calls to read, scan data


Pietro Michiardi (Eurecom) Tutorial: HBase 38 / 100
Introduction HBase Sketch
HBase building blocks
HBase implementation
Compactions

Flushing data from memstores to disk implies the creation of new


HFiles each time
We end up with many (possibly small) les
We need to do housekeeping [WHY?]
Minor Compaction

Rewrites small HFiles into fewer, larger HFiles

This is done using an n-way merge


1
Major Compaction

Rewrites all les within a column family or a region in a new one

Drop deleted data

Perform predicated deletion (e.g. delete old data)


1
What is MergeSort?
Pietro Michiardi (Eurecom) Tutorial: HBase 39 / 100
Introduction HBase Sketch
HBase: a glance at the architecture
uistiiLuteu systems can use to negotiate owneiship, iegistei seivices, oi watch loi
upuates.
Eveiy iegion seivei cieates its own ephemeial noue in ZooKeepei, which the mastei,
in tuin, uses to uiscovei availaLle seiveis. They aie also useu to tiack seivei lailuies oi
netwoik paititions.
Ephemeial noues aie Lounu to the session Letween ZooKeepei anu the client which
cieateu it. The session has a heaitLeat keepalive mechanism that, once it lails to iepoit,
is ueclaieu lost Ly ZooKeepei anu the associateu ephemeial noues aie ueleteu.
HBase uses ZooKeepei also to ensuie that theie is only one mastei iunning, to stoie
the Lootstiap location loi iegion uiscoveiy, as a iegistiy loi iegion seiveis, as well as
loi othei puiposes. ZooKeepei is a ciitical component, anu without it HBase is not
opeiational. This is mitigateu Ly ZooKeepei`s uistiiLuteu uesign using an assemLle ol
seiveis anu the Zab piotocol to keep its state consistent.
Figuie 1-S shows how the vaiious components ol HBase aie oichestiateu to make use
ol existing system, like HDFS anu ZooKeepei, Lut also auuing its own layeis to loim
a complete platloim.
Iigurc 1-8. HBasc using its own conponcnts whi|c |cvcraging cxisting systcns
The mastei seivei is also iesponsiLle loi hanuling loau Lalancing ol iegions acioss
iegion seiveis, to unloau Lusy seiveis anu move iegions to less occupieu ones. The
mastei is not pait ol the actual uata stoiage oi ietiieval path. It negotiates loau Lalancing
anu maintains the state ol the clustei, Lut nevei pioviues any uata seivices to eithei the
iegion seiveis oi the clients, anu is theieloie lightly loaueu in piactice. In auuition, it
takes caie ol schema changes anu othei metauata opeiations, such as cieation ol taLles
anu column lamilies.
Region seiveis aie iesponsiLle loi all ieau anu wiite ieguests loi all iegions they seive,
anu also split iegions that have exceeueu the conliguieu iegion size thiesholus. Clients
communicate uiiectly with them to hanule all uata-ielateu opeiations.
Region Lookups on page 3+5 has moie uetails on how clients peiloim the iegion
lookup.
26 | Chapter 1:Introduction
Master node: HMaster

Assigns regions to region servers using ZooKeeper

Handles load balancing

Not part of the data path

Holds metadata and schema


Region Servers

Handle READs and WRITEs

Handle region splitting


Pietro Michiardi (Eurecom) Tutorial: HBase 40 / 100
Architecture
Architecture
Pietro Michiardi (Eurecom) Tutorial: HBase 41 / 100
Architecture Seek vs. Transfer
Seek vs. Transfer
Fundamental difference between RDBMS and alternatives

B+Trees

Log-Structured Merge Trees


Seek vs. Transfer

Random access to individual cells

Sequential access to data


Pietro Michiardi (Eurecom) Tutorial: HBase 42 / 100
Architecture Seek vs. Transfer
B+ Trees
Dynamic, multi-level indexes

Efcient insertion, lookup and deletion

Q: Whats the difference between a B+ Tree and a Hash Table?

Frequent updates may imbalance the trees Tree optimization


and re-organization is required (which is a costly operation)
Bounds on page size

Number of keys in each branch

Larger fanout compared to binary trees

Lower number of I/O operations to nd a specic key


Support for range scans

Leafs are linked and represent an in-order list of all keys

No costly tree-traversal algorithms required


Pietro Michiardi (Eurecom) Tutorial: HBase 43 / 100
Architecture Seek vs. Transfer
LSM-Trees
Data ow

Incoming data is rst stored in a logle, sequentially

Once the log has the modication saved, data is pushed in memory

In-memory store holds most recent updates for fast lookup

When memory is full, data is ushed in a store le to disk, as a


sorted list of key record pair

At this point, the log le can be thrown away


How store les are arranged

Similar idea of a B+ Tree, but optimized for sequential disk access

All nodes of the tree try to be lled up completely

Updates are done in a rolling merge fashion

The system packs existing on-disk multi-page blocks with in-memory


data until the block reaches full capacity
Pietro Michiardi (Eurecom) Tutorial: HBase 44 / 100
Architecture Seek vs. Transfer
LSM-Trees
Clean-up process

As ushes take place over time, a lot of store les are created

Background process aggregates les into larger ones to limit disk


seeks

All store les are always sorted by key no re-ordering required to


t new keys in
Data Lookup

Lookups are done in a merging fashion

First lookup in the in-memory store

If miss, the lookup in the on-disk store


Deleting data

Use a delete marker

When pages are re-written, deleted markers and keys are


eventually dropped

Predicate deletion happens here


Pietro Michiardi (Eurecom) Tutorial: HBase 45 / 100
Architecture Seek vs. Transfer
B+ Tree vs. LSM-Trees
B+ Tree [1]

Work well when there are not so many updates

The more and the faster you insert data at random locations the
faster pages get fragmented

Updates and deletes are done at disk seek rates, rather than
transfer rates
LSM-Tree [7]

Work at disk transfer rate and scale better to huge amounts of data

Guarantee a consistent insert rate

They transform random into sequential writes

Reads are independent from writes

Optimized data layout which offers predictable boundaries on disk


seeks
Pietro Michiardi (Eurecom) Tutorial: HBase 46 / 100
Architecture Storage
Storage
Overview
has a pietty complete pictuie ol wheie to get iows without neeuing to gueiy
the .META. seivei again. See Region Lookups on page 3+5 loi moie uetails.
The HMaster is iesponsiLle loi assigning the iegions to each HRegion
Server when you stait HBase. This also incluues the special -ROOT-
anu .META. taLles. See The Region Lile Cycle on page 3+S loi uetails.
The HRegionServer opens the iegion anu cieates a coiiesponuing HRegion oLject. Vhen
the HRegion is openeu it sets up a Store instance loi each HColumnFamily loi eveiy taLle
as uelineu Ly the usei Leloiehanu. Each Store instance can, in tuin, have one oi moie
StoreFile instances, which aie lightweight wiappeis aiounu the actual stoiage lile
calleu HFile. A Store also has a MemStore, anu the HRegionServer a shaieu HLog in-
stance (see Viite-Aheau Log on page 333).
Write Path
The client issues an HTable.put(Put) ieguest to the HRegionServer, which hanus the
uetails to the matching HRegion instance. The liist step is to wiite the uata to the wiite-
aheau log (the VAL), iepiesenteu Ly the HLog class.

The VAL is a stanuaiu Hauoop


SequenceFile anu it stoies HLogKey instances. These keys contain a seguential numLei
Iigurc 8-3. Ovcrvicw oj how HBasc hand|cs ji|cs in thc ji|csystcn, which storcs thcn transparcnt|y
in HDIS
In extieme cases, you may tuin oll this step Ly setting a llag using the Put.setWriteToWAL(boolean) methou.
This is not iecommenueu as this will uisaLle uuiaLility.
320 | Chapter 8:Architecture
Figure: Overview of how HBase handles les in the lesystem
Pietro Michiardi (Eurecom) Tutorial: HBase 47 / 100
Architecture Storage
Storage
Overview
HBase handles two kinds of le types

One is used for the WAL

One is used for the actual data storage


Who does what

HMaster

Low-level operations

Assigns region servers to key space

Keeps metadata

Talks to ZooKeeper

HRegionServer

Handles the WAL and HFiles

These les are divided in to blocks and stored into HDFS

Block size is a parameter


Pietro Michiardi (Eurecom) Tutorial: HBase 48 / 100
Architecture Storage
Storage
Overview
General communication ow

A client contacts ZooKeeper when trying to access a particular row

Recovers from ZooKeeper the server name that host the -ROOT-
region

Using the -ROOT- information the client retrieves the server name
that host the .META. table region

The .META. table region contains the row key in question

Contact the reported .META. server and retrieve the server name
that has the region containing the row key in question
Caching

Generally, lookup procedures involve caching row key locations for


faster subsequent lookups
Pietro Michiardi (Eurecom) Tutorial: HBase 49 / 100
Architecture Storage
Storage
Overview
Important Java Classes

HRegionServer handles one or more regions and create the


corresponding HRegion object

When an HRegion object is opened it creates a Store instance for


each HColumnFamily

Each Store instance can have:

One or more StoreFile instances

A MemStore instance

HRegionServer has a shared HLog instance


Pietro Michiardi (Eurecom) Tutorial: HBase 50 / 100
Architecture Storage
Storage
Write Path
External client insert data in HBase

Issues an HTable.put(Put) request to HRegionServer

HRegionServer hands the request to the HRegion instance that


matches the request [Q: What is the matching criteria?]
How the system reacts to a write request

Write data to the WAL, represented by the HLog class

The WAL stores HLogKey instances in a HDFS SequenceFile

These keys contain a sequence number and the actual data

In case of failure, this data can be used to replay not-yet-persisted


data

Copy data in the MemStore

Check if MemStore size has reached a threshold

If yes, launch a ush request

Launch a thread in the HRegionServer and ush MemStore data to


an HFile
Pietro Michiardi (Eurecom) Tutorial: HBase 51 / 100
Architecture Storage
Storage
HBase Files
What and where are HBase les (including WAL, HFile,...)
stored?

HBase has a root directory set to /hbase in HDFS

Files can be divided into:

Those that reside under the HBase root directory

Those that are in the per-table directories


/hbase

.logs

.oldlogs

.hbase.id

.hbase.version

/example-table
Pietro Michiardi (Eurecom) Tutorial: HBase 52 / 100
Architecture Storage
Storage
HBase Files
/example-table

.tableinfo

.tmp

...Key1...

.oldlogs

.regioninfo

.tmp

colfam1/
colfam1/

....column-key1...
Pietro Michiardi (Eurecom) Tutorial: HBase 53 / 100
Architecture Storage
Storage
HBase: Root-level les
.logs directory

WAL les handled by HLog instances

Contains a subdir for each HRegionServer

Each subdir contains many HLog les

All regions from that HRegionServer share the same HLog les
.oldlogs directory

When data is persisted to disk (from Memstores) log les are


decommissioned to the .oldlogs dir
hbase.id and hbase.version

Represent the unique ID of the cluster and the le format version


Pietro Michiardi (Eurecom) Tutorial: HBase 54 / 100
Architecture Storage
Storage
HBase: Table-level les
Every table has its own directory

.tableinfo: stores the serialized HTableDescriptor

This include the table and column family schema

.tmp directory

Contains temporary data


Pietro Michiardi (Eurecom) Tutorial: HBase 55 / 100
Architecture Storage
Storage
HBase: Region-level les
Inside each table dir, there is a separate dir for every region
in the table

The name of each of this dirs is the MD5 hash of a region name

Inside each region there is a directory for each column family

Each column family directory holds the actual data les, namely
HFiles

Their name is just an arbitrary random number

Each region directory also has a .regioninfo le

Contains the serialized information of the HRegionInfo instance


Split Files

Once the region needs to be split, a splits directory is created

This is used to stage two daughter regions

If split is successful, daughter regions are moved up to the table


directory
Pietro Michiardi (Eurecom) Tutorial: HBase 56 / 100
Architecture Storage
Storage
HBase: A note on region splits
Splits triggered by store le (region) size

Region is split in two

Region is closed to new requests

.META. is updated
Daughter regions initially reside on the same server

Both daughters are compacted

Parent is cleaned up

.META. is updated
Master schedules new regions to be moved off to other
servers
Pietro Michiardi (Eurecom) Tutorial: HBase 57 / 100
Architecture Storage
Storage
HBase: Compaction
Process that takes care of re-organizing store les

Essentially to conform to underlying lesystem requirements

Compaction check when memstore is ushed


Minor and Major compactions

Always from the oldest to the newest les

Avoid all servers to perform compaction concurrently


Compactions
The stoie liles aie monitoieu Ly a Lackgiounu thieau to keep them unuei contiol. The
llushes ol memstoies slowly Luilu up an incieasing numLei ol on-uisk liles. Il theie aie
enough ol them, the conpaction piocess will comLine them to a lew, laigei liles. This
goes on until the laigest ol these liles exceeus the conliguieu maximum stoie lile size
anu tiiggeis a iegion split (see Region splits on page 326).
Compactions come in two vaiieties: ninor anu najor. Minoi compactions aie iespon-
siLle loi iewiiting the last lew liles into one laigei one. The numLei ol liles is set
with the hbase.hstore.compaction.min piopeity (which was pieviously calleu
hbase.hstore.compactionThreshold, anu although uepiecateu is still suppoiteu). It is
set to 3 Ly uelault, anu neeus to Le at least 2 oi moie. A numLei too laige woulu uelay
minoi compactions, Lut also woulu ieguiie moie iesouices anu take longei once the
compactions stait.
The maximum numLei ol liles to incluue in a minoi compaction is set to 10, anu is
conliguieu with hbase.hstore.compaction.max. The list is luithei naiioweu uown Ly
the hbase.hstore.compaction.min.size (set to the conliguieu memstoie llush size loi
the iegion), anu the hbase.hstore.compaction.max.size (uelaults to Long.MAX_VALUE)
conliguiation piopeities. Any lile laigei than the maximum compaction size is always
excluueu. The minimum compaction size woiks slightly uilleiently: it is a thiesholu
iathei than a pei-lile limit. It incluues all liles that aie unuei that limit, up to the total
numLei ol liles pei compaction alloweu.
Figuie S-+ shows an example set ol stoie liles. All liles that lit unuei the minimum
compaction thiesholu aie incluueu in the compaction piocess.
Iigurc 8-1. A sct oj storc ji|cs showing thc nininun conpaction thrcsho|d
The algoiithm uses hbase.hstore.compaction.ratio (uelaults to 1.2, oi 120) to ensuie
that it uoes incluue enough liles in the selection piocess. The iatio will also select liles
that aie up to that size compaieu to the sum ol the stoie lile sizes ol all newei liles. The
evaluation always checks the liles liom the oluest to the newest. This ensuies that oluei
liles aie compacteu liist. The comLination ol these piopeities allows you to line-tune
how many liles aie incluueu in a minoi compaction.
328 | Chapter 8:Architecture
Figure: A set of store les showing the minimum compaction threshold
Pietro Michiardi (Eurecom) Tutorial: HBase 58 / 100
Architecture Storage
Storage
HFile format
Store les are implemented by the HFile class

Efcient data storage is the goal


HFiles consist of a variable number of blocks

Two xed blocks: info and trailer

index block: records the offsets of the data and meta blocks

Block size: large sequential access; small random access


In contiast to minoi compactions, majoi compactions compact all liles into a single
lile. Vhich compaction type is iun is automatically ueteimineu when the compaction
check is executeu. The check is tiiggeieu eithei altei a memstoie has Leen llusheu to
uisk, altei the conpact oi najor_conpact shell commanus oi coiiesponuing API calls
have Leen invokeu, oi Ly a Lackgiounu thieau. This Lackgiounu thieau is calleu the
CompactionChecker anu each iegion seivei iuns a single instance. It iuns a check on a
iegulai Lasis, contiolleu Ly hbase.server.thread.wakefrequency (anu multiplieu Ly
hbase.server.thread.wakefrequency.multiplier, set to 1000, to iun it less olten than
the othei thieau-Laseu tasks).
Il you call the najor_conpact shell commanu, oi the majorCompact() API call, you loice
the majoi compaction to iun. Otheiwise, the seivei checks liist il the majoi compaction
is uue, Laseu on hbase.hregion.majorcompaction (set to 2+ houis) liom the liist time it
ian. The hbase.hregion.majorcompaction.jitter (set to 0.2, in othei woius, 20) cau-
ses this time to Le spieau out loi the stoies. Vithout the jittei, all stoies woulu iun a
majoi compaction at the same time, eveiy 2+ houis. See Manageu Split-
ting on page +29 loi inloimation on why this is a Lau iuea anu how to manage this
Lettei.
Il no majoi compaction is uue, a minoi compaction is assumeu. Baseu on the aloie-
mentioneu conliguiation piopeities, the seivei ueteimines il enough liles loi a minoi
compaction aie availaLle anu continues il that is the case.
Minoi compactions might Le piomoteu to majoi compactions when the loimei woulu
incluue all stoie liles, anu theie aie less than the conliguieu maximum liles pei
compaction.
HFile Format
The actual stoiage liles aie implementeu Ly the HFile class, which was specilically
cieateu to seive one puipose: stoie HBase`s uata elliciently. They aie Laseu on Ha-
uoop`s TFile class,

anu mimic the SSTab|c loimat useu in Google`s BigtaLle aichitec-


tuie. The pievious use ol Hauoop`s MapFile class in HBase pioveu to Le insullicient in
teims ol peiloimance. Figuie S-5 shows the lile loimat uetails.
See the ]IRA issue HADOOP-3315 loi uetails.
Iigurc 8-5. Thc HIi|c structurc
Storage | 329
Figure: The HFile structure
Pietro Michiardi (Eurecom) Tutorial: HBase 59 / 100
Architecture Storage
Storage
HFile size and HDFS block size
HBase uses any underlying lesystem
In case HDFS is used

HDFS block size is generally 64MB

This is 1,024 times the default HFile block size (64 KB)
There is no correlation between HDFS block and HFile sizes
Pietro Michiardi (Eurecom) Tutorial: HBase 60 / 100
Architecture Storage
Storage
The KeyValue Format
Each KeyValue in the HFile is a low-level byte array

It allows for zero-copy access to the data


Format

Fixed-length preambule indicates the length of the key and value

This is useful to offset into the array to get direct access to the value,
ignoring the key

Key format

Contains row key, column family name, column qualier...

[TIP]: consider small keys to avoid overhead when storing small


data
Iigurc 8-7. Thc Kcy\a|uc jornat
The stiuctuie staits with two lixeu-length numLeis inuicating the size anu value ol the
key. Vith that inloimation, you can ollset into the aiiay to, loi example, get uiiect
access to the value, ignoiing the key. Otheiwise, you can get the ieguiieu inloimation
liom the key. Once the inloimation is paiseu into a KeyValue ]ava instance, you can
use getteis to access the uetails, as explaineu in The KeyValue class on page S3.
The ieason the aveiage key in the pieceuing example is laigei than the value has to uo
with the lielus that make up the key pait ol a KeyValue. The key holus the iow key, the
column lamily name, the column gualiliei, anu so on. Foi a small payloau, this iesults
in guite a consiueiaLle oveiheau. Il you ueal with small values, tiy to keep the key small
as well. Choose a shoit iow anu column key (the lamily name with a single Lyte, anu
the gualiliei egually shoit) to keep the iatio in check.
On the othei hanu, compiession shoulu help mitigate the oveiwhelming key size pioL-
lem, as it looks at linite winuows ol uata, anu all iepeating uata shoulu compiess well.
The soiting ol all KeyValues in the stoie lile helps to keep similai keys (anu possiLly
values too, in case you aie using veisioning) close togethei.
Write-Ahead Log
The iegion seiveis keep uata in-memoiy until enough is collecteu to waiiant a llush to
uisk, avoiuing the cieation ol too many veiy small liles. Vhile the uata iesiues in mem-
oiy it is volatile, meaning it coulu Le lost il the seivei loses powei, loi example. This is
a likely occuiience when opeiating at laige scale, as explaineu in Seek Veisus Tians-
lei on page 315.
A common appioach to solving this issue is writc-ahcad |ogging:
=
Each upuate (also
calleu an euit) is wiitten to a log, anu only il the upuate has succeeueu is the client
inloimeu that the opeiation has succeeueu. The seivei then has the liLeity to Latch oi
aggiegate the uata in memoiy as neeueu.
Overview
The VAL is the lileline that is neeueu when uisastei stiikes. Similai to a binary |og in
MySQL, the VAL iecoius all changes to the uata. This is impoitant in case something
=Foi inloimation on the teim itsell, ieau Viite-aheau logging on Vikipeuia.
Write-Ahead Log | 333
Figure: The KeyValue Format
Pietro Michiardi (Eurecom) Tutorial: HBase 61 / 100
Architecture WAL
The Write-Ahead Log
Main tool to ensure resiliency to failures

Region servers keep data in-memory until enough is collected to


warrant a ush

What if the server crashes or power is lost?


WAL is a common approach to address fault-tolerance

Every data update is rst written to a log

Log is persisted (and replicated, since it resides on HDFS)

Only when log is written, client is notied a successful operation on


data
Pietro Michiardi (Eurecom) Tutorial: HBase 62 / 100
Architecture WAL
The Write-Ahead Log
happens to the piimaiy stoiage. Il the seivei ciashes, the VAL can ellectively rcp|ay
the log to get eveiything up to wheie the seivei shoulu have Leen just Leloie the ciash.
It also means that il wiiting the iecoiu to the VAL lails, the whole opeiation must Le
consiueieu a lailuie.
Oveiview on page 319 shows how the VAL lits into the oveiall aichitectuie ol HBase.
Since it is shaieu Ly all iegions hosteu Ly the same iegion seivei, it acts as a cential
logging LackLone loi eveiy mouilication. Figuie S-S shows how the llow ol euits is split
Letween the memstoies anu the VAL.
Iigurc 8-8. A|| nodijications savcd to thc WAL, and thcn passcd on to thc ncnstorcs
The piocess is as lollows: liist the client initiates an action that mouilies uata. This can
Le, loi example, a call to put(), delete(), anu increment(). Each ol these mouilications
is wiappeu into a KeyValue oLject instance anu sent ovei the wiie using RPC calls. The
calls aie (iueally) Latcheu to the HRegionServer that seives the matching iegions.
Once the KeyValue instances aiiive, they aie iouteu to the HRegion instances that aie
iesponsiLle loi the given iows. The uata is wiitten to the VAL, anu then put into the
MemStore ol the actual Store that holus the iecoiu. This is, in essence, the writc path ol
HBase.
334 | Chapter 8:Architecture
Figure: The write path of HBase
WAL records all changes to data

Can be replayed in case of server failure

If write to WAL fails, the whole operations has to fail


Pietro Michiardi (Eurecom) Tutorial: HBase 63 / 100
Architecture WAL
The Write-Ahead Log
happens to the piimaiy stoiage. Il the seivei ciashes, the VAL can ellectively rcp|ay
the log to get eveiything up to wheie the seivei shoulu have Leen just Leloie the ciash.
It also means that il wiiting the iecoiu to the VAL lails, the whole opeiation must Le
consiueieu a lailuie.
Oveiview on page 319 shows how the VAL lits into the oveiall aichitectuie ol HBase.
Since it is shaieu Ly all iegions hosteu Ly the same iegion seivei, it acts as a cential
logging LackLone loi eveiy mouilication. Figuie S-S shows how the llow ol euits is split
Letween the memstoies anu the VAL.
Iigurc 8-8. A|| nodijications savcd to thc WAL, and thcn passcd on to thc ncnstorcs
The piocess is as lollows: liist the client initiates an action that mouilies uata. This can
Le, loi example, a call to put(), delete(), anu increment(). Each ol these mouilications
is wiappeu into a KeyValue oLject instance anu sent ovei the wiie using RPC calls. The
calls aie (iueally) Latcheu to the HRegionServer that seives the matching iegions.
Once the KeyValue instances aiiive, they aie iouteu to the HRegion instances that aie
iesponsiLle loi the given iows. The uata is wiitten to the VAL, anu then put into the
MemStore ol the actual Store that holus the iecoiu. This is, in essence, the writc path ol
HBase.
334 | Chapter 8:Architecture
Write Path

Client modies data (put(), delete(), increment())

Modications are wrapped into a KeyValue object

Objects are batched to the corresponding HRegionServer

Objects are routed to the corresponding HRegion

Objects are written to WAL and in the MemStore


Pietro Michiardi (Eurecom) Tutorial: HBase 64 / 100
Architecture Read Path
Read Path
HBase uses multiple store les per column family

These can be either in-memory and/or materialized on disk

Compactions and clean-up background processes take care of


store les maintenance

Store les are immutable, so deletion is handled in a special way


The anatomy of a get command

HBase uses a QueryMatcher in combination with a


ColumnTracker

First, an exclusion check is performed to lter skip les (and


eventually tombstone labelled data)

Scanning data is implemented by a RegionScanner class which


retrieves a StoreScanner

StoreScanner includes both the MemStore and HFiles

Read/Scans happen in the same order as data is saved


Pietro Michiardi (Eurecom) Tutorial: HBase 65 / 100
Architecture Region Lookups
Region Lookups
How does a client nd the region server hosting a specic
row key range?

HBase uses two special catalog tables, -ROOT- and .META.

The -ROOT- table is used to refer to all regions in the .META. table
Three-level B+ Tree -like operation

Level 1: a node stored in ZooKeeper, containing the location


(region server) of the -ROOT- table

Level 2: Lookup in the -ROOT- table to nd a matching meta region

Level 3: Retrieve the table region from the .META. table


Pietro Michiardi (Eurecom) Tutorial: HBase 66 / 100
Architecture Region Lookups
Region Lookups
Where to send requests when looking for a specic row key?

This information is cached, but the rst time or when the cache is
stale or when there is a miss due to compaction, the following
procedure applies
Recursive discovery process

Ask the region server hosting the matching .META. table to retrieve
the row key address

If the information is invalid, it backs out: asks the -ROOT- table


where the relevant .META. region is

If this fails, ask ZooKeeper where the -ROOT- table is


Pietro Michiardi (Eurecom) Tutorial: HBase 67 / 100
Architecture Region Lookups
Region Lookups
I
i
g
u
r
c

8
-
1
1
.

M
a
p
p
i
n
g

o
j

u
s
c
r

t
a
b
|
c

r
c
g
i
o
n
s
,

s
t
a
r
t
i
n
g

w
i
t
h

a
n

c
n
p
t
y

c
a
c
h
c

a
n
d

t
h
c
n

p
c
r
j
o
r
n
i
n
g

t
h
r
c
c
|
o
o
|
u
p
s
R
e
g
i
o
n

L
o
o
k
u
p
s
|
3
4
7
Pietro Michiardi (Eurecom) Tutorial: HBase 68 / 100
Key Design
Key Design
Pietro Michiardi (Eurecom) Tutorial: HBase 69 / 100
Key Design Concepts
Concepts
HBase has two fundamental key structures

Row key

Column key
Both can be used to convey meaning

Because they store particularly meaningful data

Because their sorting order is important


Pietro Michiardi (Eurecom) Tutorial: HBase 70 / 100
Key Design Concepts
Concepts
Logical vs. on-disk layout of a table

Main unit of separation within a table is the column family

The actual columns (as opposed to other column-oriented DB) are


not used to separate data

Although cells are stored logically in a table format, rows are stored
as linear sets of the cells

Cells contain all the vital information inside them


theieloie has to also stoie the iow key and column key with eveiy cell so that it can
ietain this vital piece ol inloimation.
In auuition, multiple veisions ol the same cell aie stoieu as sepaiate, consecutive cells,
auuing the ieguiieu tincstanp ol when the cell was stoieu. The cells aie soiteu in
uescenuing oiuei Ly that timestamp so that a ieauei ol the uata will see the newest
value liistwhich is the canonical access pattein loi the uata.
The entiie cell, with the auueu stiuctuial inloimation, is calleu KeyValue in HBase
teims. It has not just the column anu actual value, Lut also the iow key anu timestamp,
stoieu loi eveiy cell loi which you have set a value. The KeyValues aie soiteu Ly iow
key liist, anu then Ly column key in case you have moie than one cell pei iow in one
column lamily.
The lowei-iight pait ol the liguie shows the iesultant layout ol the logical taLle insiue
the physical stoiage liles. The HBase API has vaiious means ol gueiying the stoieu uata,
with uecieasing gianulaiity liom lelt to iight: you can select iows Ly iow keys anu
ellectively ieuuce the amount ol uata that neeus to Le scanneu when looking loi a
specilic iow, oi a iange ol iows. Specilying the column lamily as pait ol the gueiy can
eliminate the neeu to seaich the sepaiate stoiage liles. Il you only neeu the uata ol one
lamily, it is highly iecommenueu that you specily the lamily loi youi ieau opeiation.
Although the tincstanpoi vcrsionol a cell is laithei to the iight, it is anothei im-
poitant selection ciiteiion. The stoie liles ietain the timestamp iange loi all stoieu cells,
so il you aie asking loi a cell that was changeu in the past two houis, Lut a paiticulai
stoie lile only has uata that is loui oi moie houis olu it can Le skippeu completely. See
also Reau Path on page 3+2 loi uetails.
Iigurc 9-1. Rows storcd as |incar scts oj actua| cc||s, which contain a|| thc vita| injornation
358 | Chapter 9:Advanced Usage
Pietro Michiardi (Eurecom) Tutorial: HBase 71 / 100
Key Design Concepts
Concepts
theieloie has to also stoie the iow key and column key with eveiy cell so that it can
ietain this vital piece ol inloimation.
In auuition, multiple veisions ol the same cell aie stoieu as sepaiate, consecutive cells,
auuing the ieguiieu tincstanp ol when the cell was stoieu. The cells aie soiteu in
uescenuing oiuei Ly that timestamp so that a ieauei ol the uata will see the newest
value liistwhich is the canonical access pattein loi the uata.
The entiie cell, with the auueu stiuctuial inloimation, is calleu KeyValue in HBase
teims. It has not just the column anu actual value, Lut also the iow key anu timestamp,
stoieu loi eveiy cell loi which you have set a value. The KeyValues aie soiteu Ly iow
key liist, anu then Ly column key in case you have moie than one cell pei iow in one
column lamily.
The lowei-iight pait ol the liguie shows the iesultant layout ol the logical taLle insiue
the physical stoiage liles. The HBase API has vaiious means ol gueiying the stoieu uata,
with uecieasing gianulaiity liom lelt to iight: you can select iows Ly iow keys anu
ellectively ieuuce the amount ol uata that neeus to Le scanneu when looking loi a
specilic iow, oi a iange ol iows. Specilying the column lamily as pait ol the gueiy can
eliminate the neeu to seaich the sepaiate stoiage liles. Il you only neeu the uata ol one
lamily, it is highly iecommenueu that you specily the lamily loi youi ieau opeiation.
Although the tincstanpoi vcrsionol a cell is laithei to the iight, it is anothei im-
poitant selection ciiteiion. The stoie liles ietain the timestamp iange loi all stoieu cells,
so il you aie asking loi a cell that was changeu in the past two houis, Lut a paiticulai
stoie lile only has uata that is loui oi moie houis olu it can Le skippeu completely. See
also Reau Path on page 3+2 loi uetails.
Iigurc 9-1. Rows storcd as |incar scts oj actua| cc||s, which contain a|| thc vita| injornation
358 | Chapter 9:Advanced Usage
Logical Layout (Top-Left)

Table consists of rows and columns

Columns are the combination of a column family name and a


column qualier
<cf name: qualifier> is the column key

Rows have a row key to address all columns of a single logical row
Pietro Michiardi (Eurecom) Tutorial: HBase 72 / 100
Key Design Concepts
Concepts
theieloie has to also stoie the iow key and column key with eveiy cell so that it can
ietain this vital piece ol inloimation.
In auuition, multiple veisions ol the same cell aie stoieu as sepaiate, consecutive cells,
auuing the ieguiieu tincstanp ol when the cell was stoieu. The cells aie soiteu in
uescenuing oiuei Ly that timestamp so that a ieauei ol the uata will see the newest
value liistwhich is the canonical access pattein loi the uata.
The entiie cell, with the auueu stiuctuial inloimation, is calleu KeyValue in HBase
teims. It has not just the column anu actual value, Lut also the iow key anu timestamp,
stoieu loi eveiy cell loi which you have set a value. The KeyValues aie soiteu Ly iow
key liist, anu then Ly column key in case you have moie than one cell pei iow in one
column lamily.
The lowei-iight pait ol the liguie shows the iesultant layout ol the logical taLle insiue
the physical stoiage liles. The HBase API has vaiious means ol gueiying the stoieu uata,
with uecieasing gianulaiity liom lelt to iight: you can select iows Ly iow keys anu
ellectively ieuuce the amount ol uata that neeus to Le scanneu when looking loi a
specilic iow, oi a iange ol iows. Specilying the column lamily as pait ol the gueiy can
eliminate the neeu to seaich the sepaiate stoiage liles. Il you only neeu the uata ol one
lamily, it is highly iecommenueu that you specily the lamily loi youi ieau opeiation.
Although the tincstanpoi vcrsionol a cell is laithei to the iight, it is anothei im-
poitant selection ciiteiion. The stoie liles ietain the timestamp iange loi all stoieu cells,
so il you aie asking loi a cell that was changeu in the past two houis, Lut a paiticulai
stoie lile only has uata that is loui oi moie houis olu it can Le skippeu completely. See
also Reau Path on page 3+2 loi uetails.
Iigurc 9-1. Rows storcd as |incar scts oj actua| cc||s, which contain a|| thc vita| injornation
358 | Chapter 9:Advanced Usage
Folding the Logical Layout (Top-Right)

The cells of each row are stored one after the other

Each column family are stored separately


On disk all cells of one family reside on an individual StoreFile

HBase does not store unset cells


Row and column key is required to address every cell
Pietro Michiardi (Eurecom) Tutorial: HBase 73 / 100
Key Design Concepts
Concepts
Versioning

Multiple versions of the same cell stored consecutively, together


with the timestamp

Cells are sorted in descending order of timestamp


Newest value rst
KeyValue object

The entire cell, with all the structural information, is a KeyValue


object

Contains: row key, <column family: qualifier>


column key, timestamp and value

Sorted by row key rst, then by column key


Pietro Michiardi (Eurecom) Tutorial: HBase 74 / 100
Key Design Concepts
Concepts
theieloie has to also stoie the iow key and column key with eveiy cell so that it can
ietain this vital piece ol inloimation.
In auuition, multiple veisions ol the same cell aie stoieu as sepaiate, consecutive cells,
auuing the ieguiieu tincstanp ol when the cell was stoieu. The cells aie soiteu in
uescenuing oiuei Ly that timestamp so that a ieauei ol the uata will see the newest
value liistwhich is the canonical access pattein loi the uata.
The entiie cell, with the auueu stiuctuial inloimation, is calleu KeyValue in HBase
teims. It has not just the column anu actual value, Lut also the iow key anu timestamp,
stoieu loi eveiy cell loi which you have set a value. The KeyValues aie soiteu Ly iow
key liist, anu then Ly column key in case you have moie than one cell pei iow in one
column lamily.
The lowei-iight pait ol the liguie shows the iesultant layout ol the logical taLle insiue
the physical stoiage liles. The HBase API has vaiious means ol gueiying the stoieu uata,
with uecieasing gianulaiity liom lelt to iight: you can select iows Ly iow keys anu
ellectively ieuuce the amount ol uata that neeus to Le scanneu when looking loi a
specilic iow, oi a iange ol iows. Specilying the column lamily as pait ol the gueiy can
eliminate the neeu to seaich the sepaiate stoiage liles. Il you only neeu the uata ol one
lamily, it is highly iecommenueu that you specily the lamily loi youi ieau opeiation.
Although the tincstanpoi vcrsionol a cell is laithei to the iight, it is anothei im-
poitant selection ciiteiion. The stoie liles ietain the timestamp iange loi all stoieu cells,
so il you aie asking loi a cell that was changeu in the past two houis, Lut a paiticulai
stoie lile only has uata that is loui oi moie houis olu it can Le skippeu completely. See
also Reau Path on page 3+2 loi uetails.
Iigurc 9-1. Rows storcd as |incar scts oj actua| cc||s, which contain a|| thc vita| injornation
358 | Chapter 9:Advanced Usage
Physical Layout (Lower-Right)

Select data by row key

This reduces the amount of data to scan for a row or a range of rows

Select data by row key and column key

This focuses the system on an individual storage le

Select data by column qualier

Exact lookups, including lters to omit useless data


Pietro Michiardi (Eurecom) Tutorial: HBase 75 / 100
Key Design Concepts
Concepts
Summary of key lookup properties
The next level ol gueiy gianulaiity is the co|unn qua|ijicr. You can employ exact column
lookups when ieauing uata, oi ueline lilteis that can incluue oi excluue the columns
you neeu to access. But as you will have to look at each KeyValue to check il it shoulu
Le incluueu, theie is only a minoi peiloimance gain.
The va|uc iemains the last, anu Lioauest, selection ciiteiion, egualing the column
gualiliei`s ellectiveness: you neeu to look at each cell to ueteimine il it matches the ieau
paiameteis. You can only use a liltei to specily a matching iule, making it the least
ellicient gueiy option. Figuie 9-2 summaiizes the ellects ol using the KeyValue lielus.
Iigurc 9-2. Rctricva| pcrjornancc dccrcasing jron |cjt to right
The ciucial pait ol Figuie 9-1 shows is the shijt in the lowei-lelthanu siue. Since the
ellectiveness ol selection ciiteiia gieatly uiminishes liom lelt to iight loi a KeyValue,
you can move all, oi paitial, uetails ol the value into a moie signilicant placewithout
changing how much uata is stoieu.
Tall-Narrow Versus Flat-Wide Tables
At this time, you may Le asking youisell wheie anu how you shoulu stoie youi uata.
The two choices aie ta||-narrow anu j|at-widc. The loimei is a taLle with lew columns
Lut many iows, while the lattei has lewei iows Lut many columns. Given the explaineu
gueiy gianulaiity ol the KeyValue inloimation, it seems to Le auvisaLle to stoie paits ol
the cell`s uataespecially the paits neeueu to gueiy itin the iow key, as it has the
highest caiuinality.
In auuition, HBase can only split at iow Lounuaiies, which also enloices the iecom-
menuation to go with tall-naiiow taLles. Imagine you have all emails ol a usei in a single
iow. This will woik loi the majoiity ol useis, Lut theie will Le outlieis that will have
magnituues ol emails moie in theii inLoxso many, in lact, that a single iow coulu
outgiow the maximum lile/iegion size anu woik against the iegion split lacility.
Key Design | 359
Pietro Michiardi (Eurecom) Tutorial: HBase 76 / 100
Key Design Tall-Narrow vs. Flat-Wide
Tall-Narrow vs. Flat-Wide Tables
Tall-Narrow Tables

Few columns

Many rows
Flat-Wide Tables

Many columns

Few rows
Given the query granularity explained before
Store parts of the cell data in the row key

Furthermore, HBase splits at row boundaries


It is recommended to go for Tall-Narrow Tables
Pietro Michiardi (Eurecom) Tutorial: HBase 77 / 100
Key Design Tall-Narrow vs. Flat-Wide
Tall-Narrow vs. Flat-Wide Tables
Example: email data - version 1

You have all emails of a user in a single row (e.g. userID is the
row key)

There will be some outliers with orders of magnitude more emails


than others
A single row could outgrow the maximum le/region size and work
against split facility
Example: email data - version 2

Each email of a user is stored in a separate row (e.g.


userID:messageID is the row key)

On disk this makes no difference (see the disk layout gure)

If the messageID is in the column qualier or the row key, each cell
still contains a single email message
The table can be split easily and the query granularity is more
ne-grained
Pietro Michiardi (Eurecom) Tutorial: HBase 78 / 100
Key Design Partial Key Scans
Partial Key Scans
Partial Key Scans reinforce the concept of Tall-Narrow Tables

From the email example: assume you have a separate row per
message, across all users

If you dont have an exact combination of user and message ID you


cannot access a particular message
Partial Key Scan solves the problems

Specify a start and end key

The start key is set to the exact userID only, with the end key set
at userID+1
This triggers the internal lexicographic comparison mechanism

Since the table does not have an exact match, it positions the scan at:
<userID>:<lowest-messageID>

The scan will then iterate over all the messages of an exact user,
parse the row key and get the messageID
Pietro Michiardi (Eurecom) Tutorial: HBase 79 / 100
Key Design Partial Key Scans
Partial Key Scans
Composite keys and atomicity

Following the email example: a single user inbox now spans many
rows

It is no longer possible to modify a single user inbox in one atomic


operation
If this is acceptable or not, depends on the application at
hand
Pietro Michiardi (Eurecom) Tutorial: HBase 80 / 100
Key Design Time Series Data
Time Series Data
Stream processing of events

E.g. data coming from a sensor, stock exchange, monitoring


system ...

Such data is a time series The row key represents the event
time
HBase will store all rows sorted in a distinct range, namely regions
with specic start and stop keys
Sequential monotonously increasing nature of time series
data

All incoming data is written to the same region (and hence the
same server)
Regions become HOT!

Performance of the whole cluster is bound to that of a single


machine
Pietro Michiardi (Eurecom) Tutorial: HBase 81 / 100
Key Design Time Series Data
Time Series Data
Solution to achieve load balancing: Salting

We want data to be spread over all region servers

This can be done, e.g., by prexing the row key with a


non-sequential number
Salting example
byte prefix = (byte) (Long.hashCode(timestamp) % <number of
region servers>);
byte[] rowkey = Bytes.add(Bytes.toBytes(prefix),
Bytes.toBytes(timestamp));
- Data access needs to be fanned out across many servers
+ Use multiple threads to read for I/O performance: e.g. use the
Map phase of MapReduce
Pietro Michiardi (Eurecom) Tutorial: HBase 82 / 100
Key Design Time Series Data
Time Series Data
Solution to achieve load balancing: Field swap/promotion

Move the timestamp led of the row key or prex it with another
eld

If you already have a composite row key, simply swap elements

Otherwise if you only have the timestamp, you need to promote


another eld

The sequential, monotonously increasing timestamp is moved to a


secondary position in the row key
- You can only access data (especially time ranges) for a given
swapped or promoted eld (but this could be a feature)
+ You achieve load balancing
Pietro Michiardi (Eurecom) Tutorial: HBase 83 / 100
Key Design Time Series Data
Time Series Data
Solution to achieve load balancing: Randomization

byte[] rowkey = MD5(timestamp)

This gives you a random distribution of the row key across all
available region servers
- Less than ideal for range scans
+ Since you can re-hash the timestamp, this solution is good for
random access
Pietro Michiardi (Eurecom) Tutorial: HBase 84 / 100
Key Design Time Series Data
Time Series Data
Summary
Using the salteu oi piomoteu lielu keys can stiike a goou Lalance ol uistiiLution loi
wiite peiloimance, anu seguential suLsets ol keys loi ieau peiloimance. Il you aie only
uoing ianuom ieaus, it makes most sense to use ianuom keys: this will avoiu cieating
iegion hot-spots.
Time-Ordered Relations
In oui pieceuing uiscussion, the time seiies uata uealt with inseiting new events as
sepaiate iows. Howevei, you can also stoie ielateu, time-oiueieu uata: using the col-
umns ol a taLle. Since all ol the columns aie soiteu pei column lamily, you can tieat
this soiting as a ieplacement loi a seconuaiy inuex, as availaLle in RDBMSes. Multiple
seconuaiy inuexes can Le emulateu Ly using multiple column lamiliesalthough that
is not the iecommenueu way ol uesigning a schema. But loi a small numLei ol inuexes,
this might Le what you neeu.
Consiuei the eailiei example ol the usei inLox, which stoies all ol the emails ol a usei
in a single iow. Since you want to uisplay the emails in the oiuei they weie ieceiveu,
Lut, loi example, also soiteu Ly suLject, you can make use ol column-Laseu soiting to
achieve the uilleient views ol the usei inLox.
Given the auvice to keep the numLei ol column lamilies in a taLle low
especially when mixing laige lamilies with small ones (in teims ol stoieu
uata)you coulu stoie the inLox insiue one taLle, anu the seconuaiy
inuexes in anothei taLle. The uiawLack is that you cannot make use ol
the pioviueu pei-taLle iow-level atomicity. Also see Seconuaiy In-
uexes on page 370 loi stiategies to oveicome this limitation.
The liist uecision to make conceins what the piimaiy soiting oiuei is, in othei woius,
how the majoiity ol useis have set the view ol theii inLox. Assuming they have set the
Iigurc 9-3. Iinding thc right ba|ancc bctwccn scqucntia| rcad and writc pcrjornancc
Key Design | 367
Pietro Michiardi (Eurecom) Tutorial: HBase 85 / 100
MapReduce Integration
MapReduce Integration
Pietro Michiardi (Eurecom) Tutorial: HBase 86 / 100
MapReduce Integration Recap
Introduction
In the following we review the main classes involved in
reading and writing data from/to an underlying data store
For MapReduce to work with HBase, some more practical
issues have to be addressed

E.g.: creating an appropriate JAR le inclusive of all required


libraries
Refer to [5], Chapter 7 for an in-depth treatment of this subject
Pietro Michiardi (Eurecom) Tutorial: HBase 87 / 100
MapReduce Integration Recap
Main classes involved in MapReduce
Classes
Figuie 7-1 also shows you the classes that aie involveu in the Hauoop implementation
ol MapReuuce. Let us look at them anu also at the specilic implementations that HBase
pioviues on top ol them.
Hauoop veision 0.20.0 intiouuceu a new MapReuuce API. Its classes
aie locateu in the package nameu mapreduce, while the existing classes
loi the pievious API aie locateu in mapred. The oluei API was uepiecateu
anu shoulu have Leen uioppeu in veision 0.21.0Lut that uiu not hap-
pen. In lact, the olu API was unuepiecateu since the auoption ol the new
one was hinueieu Ly its incompleteness.
HBase also has these two packages, which only uillei slightly. The new
API has moie suppoit Ly the community, anu wiiting joLs against it is
not impacteu Ly the Hauoop changes. This chaptei will only ielei to the
new API.
InputFormat
The liist class to ueal with is the InputFormat class (Figuie 7-2). It is iesponsiLle loi two
things. Fiist it splits the input uata, anu then it ietuins a RecordReader instance that
uelines the classes ol the |cy anu va|uc oLjects, anu pioviues a next() methou that is
useu to iteiate ovei each input iecoiu.
Iigurc 7-1. Thc MapRcducc proccss
290 | Chapter 7:MapReduce Integration
Figure: Main MapReduce Classes
Pietro Michiardi (Eurecom) Tutorial: HBase 88 / 100
MapReduce Integration Recap
Main classes involved in MapReduce
InputFormat
It is responsible for two things

Splits input data

Returns a RecordReader instance

Denes a key and a value object

Provides a next() method to iterate over input records


As lai as HBase is conceineu, theie is a special implementation calleu TableInput
FormatBase whose suLclass is TableInputFormat. The loimei implements the majoiity
ol the lunctionality Lut iemains aLstiact. The suLclass is a lightweight conciete veision
ol TableInputFormat anu is useu Ly many supplieu samples anu ieal MapReuuce classes.
These classes implement the lull tuinkey solution to scan an HBase taLle. You have to
pioviue a Scan instance that you can piepaie in any way you want: specily stait
anu stop keys, auu lilteis, specily the numLei ol veisions, anu so on. The
TableInputFormat splits the taLle into piopei Llocks loi you anu hanus them ovei to
the suLseguent classes in the MapReuuce piocess. See TaLle Splits on page 29+ loi
uetails on how the taLle is split.
Mapper
The Mapper class(es) is loi the next stage ol the MapReuuce piocess anu one ol its
namesakes (Figuie 7-3). In this step, each iecoiu ieau using the RecordReader is pio-
cesseu using the map() methou. Figuie 7-1 also shows that the Mapper ieaus a specilic
type ol key/value paii, Lut emits possiLly anothei type. This is hanuy loi conveiting
the iaw uata into something moie uselul loi luithei piocessing.
Iigurc 7-3. Thc Mappcr hicrarchy
HBase pioviues the TableMapper class that enloices |cy c|ass 1 to Le an ImmutableBytes
Writable, anu va|uc c|ass 1 to Le a Result typesince that is what the
TableRecordReader is ietuining.
One specilic implementation ol the TableMapper is the IdentityTableMapper, which is
also a goou example ol how to auu youi own lunctionality to the supplieu classes. The
TableMapper class itsell uoes not implement anything Lut only auus the signatuies ol
Iigurc 7-2. Thc |nputIornat hicrarchy
Framework | 291
Figure: InputFormat hierarchy
Pietro Michiardi (Eurecom) Tutorial: HBase 89 / 100
MapReduce Integration Recap
Main classes involved in MapReduce
InputFormat TableInputFormatBase
Implement a full turnkey solution to scan an HBase table

Splits the table into proper blocks and hand them to the
MapReduce process
Must supply a Scan instance to interact with a table

Specify start and stop keys for the scan

Add lters (optional)

Specify the number of versions


Pietro Michiardi (Eurecom) Tutorial: HBase 90 / 100
MapReduce Integration Recap
Main classes involved in MapReduce
Mapper
Each record read using the RecordReader is processed
using the map() method
The Mapper reads specic types of input key/value pairs, but
emit possibly another type
As lai as HBase is conceineu, theie is a special implementation calleu TableInput
FormatBase whose suLclass is TableInputFormat. The loimei implements the majoiity
ol the lunctionality Lut iemains aLstiact. The suLclass is a lightweight conciete veision
ol TableInputFormat anu is useu Ly many supplieu samples anu ieal MapReuuce classes.
These classes implement the lull tuinkey solution to scan an HBase taLle. You have to
pioviue a Scan instance that you can piepaie in any way you want: specily stait
anu stop keys, auu lilteis, specily the numLei ol veisions, anu so on. The
TableInputFormat splits the taLle into piopei Llocks loi you anu hanus them ovei to
the suLseguent classes in the MapReuuce piocess. See TaLle Splits on page 29+ loi
uetails on how the taLle is split.
Mapper
The Mapper class(es) is loi the next stage ol the MapReuuce piocess anu one ol its
namesakes (Figuie 7-3). In this step, each iecoiu ieau using the RecordReader is pio-
cesseu using the map() methou. Figuie 7-1 also shows that the Mapper ieaus a specilic
type ol key/value paii, Lut emits possiLly anothei type. This is hanuy loi conveiting
the iaw uata into something moie uselul loi luithei piocessing.
Iigurc 7-3. Thc Mappcr hicrarchy
HBase pioviues the TableMapper class that enloices |cy c|ass 1 to Le an ImmutableBytes
Writable, anu va|uc c|ass 1 to Le a Result typesince that is what the
TableRecordReader is ietuining.
One specilic implementation ol the TableMapper is the IdentityTableMapper, which is
also a goou example ol how to auu youi own lunctionality to the supplieu classes. The
TableMapper class itsell uoes not implement anything Lut only auus the signatuies ol
Iigurc 7-2. Thc |nputIornat hicrarchy
Framework | 291
Figure: The Mapper hierarchy
Pietro Michiardi (Eurecom) Tutorial: HBase 91 / 100
MapReduce Integration Recap
Main classes involved in MapReduce
Mapper TableMapper
TableMapper class enforces:

The input key to the mapper to be an ImmutableBytesWritable


type

The input value to be a Result type


A handy implementation is the IdentityTableMapper

This is the equivalent of an identity mapper


Pietro Michiardi (Eurecom) Tutorial: HBase 92 / 100
MapReduce Integration Recap
Main classes involved in MapReduce
OutputFormat
Used to persist data

Output written to les

Output written to HBase tables

This is done using a TableRecordWriter


the actual key/value paii classes. The IdentityTableMapper is simply passing on the
keys/values to the next stage ol piocessing.
Reducer
The Reducer stage anu class hieiaichy (Figuie 7-+) is veiy similai to the Mapper stage.
This time we get the output ol a Mapper class anu piocess it altei the uata has Leen
shujj|cd anu sortcd.
In the implicit shullle Letween the Mapper anu Reducer stages, the inteimeuiate uata is
copieu liom uilleient Map seiveis to the Reuuce seiveis anu the soit comLines the
shullleu (copieu) uata so that the Reducer sees the inteimeuiate uata as a nicely soiteu
set wheie each unigue key is now associateu with all ol the possiLle values it was lounu
with.
Iigurc 7-1. Thc Rcduccr hicrarchy
OutputFormat
The linal stage is the OutputFormat class (Figuie 7-5), anu its joL is to peisist the uata
in vaiious locations. Theie aie specilic implementations that allow output to liles, oi
to HBase taLles in the case ol the TableOutputFormat class. It uses a TableRecord
Writer to wiite the uata into the specilic HBase output taLle.
Iigurc 7-5. Thc OutputIornat hicrarchy
It is impoitant to note the caiuinality as well. Although many Mappers aie hanuing
iecoius to many Reducers, only one OutputFormat takes each output iecoiu liom its
Reducer suLseguently. It is the linal class that hanules the key/value paiis anu wiites
them to theii linal uestination, this Leing a lile oi a taLle.
292 | Chapter 7:MapReduce Integration
Figure: The OutputFormat hierarchy
Pietro Michiardi (Eurecom) Tutorial: HBase 93 / 100
MapReduce Integration Recap
Main classes involved in MapReduce
OutputFormat TableOutputFormat
This is the class that handles the key/valu pairs and writes
them to their nal destination

Single instance that takes the output record from each reducer
subsequently
Details

Must specify the table name when the MR job is created

Handles buffer ushing implicitly (autoush option is set to false)


Pietro Michiardi (Eurecom) Tutorial: HBase 94 / 100
MapReduce Integration Recap
MapReduce Locality
How does the system make sure data is placed close to
where it is needed?

This is done implicitly by MapReduce when using HDFS

When MapReduce uses HBase things are a bit different


How HBase handles data locality

Shared vs. non-shared cluster

HBase store its les on HDFS (HFiles and WAL)

HBase servers are not restarted frequently and they perform


compactions regularly
HDFS is smart enough to ensure data locality

There is a block placement policy that enforces local writes

The data node compares the server name of the writer with its own

If they match, the block is written to the local lesystem

Just be careful about region movements during load balancing or


server failures
Pietro Michiardi (Eurecom) Tutorial: HBase 95 / 100
MapReduce Integration Recap
Table Splits
When running a MapReduce job that reads from an HBase
table you use the TableInputFormat

Overrides getSplits() and createRecordReader()


Before a job is run, the framework calls getSplit() to
determine how the data is to be separated into chunks

TableInputFormat, given the Scan instance you dene, divide


the table at region boundaries
The number of input splits is equal to all regions between the start
and stop keys
Pietro Michiardi (Eurecom) Tutorial: HBase 96 / 100
MapReduce Integration Recap
Table Splits
When a job starts, the framework calls
createRecordReader() for each input split

It iterates over the splits and create a new TableRecordReader


with the current split

Each TableRecordReader handles exactly one region, reading


and mapping every row from the regions start and end keys
Data locality

Each split contains the server name hosting the region

The framework checks the server name and if the TaskTracker is


running on the same machine, it will run it on that server

The RegionServer is colocated with the HDFS DataNode, hence


data is read from the local lesystem
TIP: Turn off speculative execution!
Pietro Michiardi (Eurecom) Tutorial: HBase 97 / 100
References
References I
[1] B+ tree.
http://en.wikipedia.org/wiki/B%2B_tree.
[2] Eric Brewer.
Lessons from giant-scale services.
In In IEEE Internet Computing, 2001.
[3] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew
Fikes, and Robert E. Gruber.
Bigtable: A distributed storage system for structured data.
In Proc. od USENIX OSDI, 2006.
[4] Jeffrey Dean and Sanjay Ghemawat.
Mapreduce: Simplied data processing on large clusters.
In Proc. of ACM OSDI, 2004.
Pietro Michiardi (Eurecom) Tutorial: HBase 98 / 100
References
References II
[5] Lars George.
HBase, The Denitive Guide.
OReilly, 2011.
[6] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung.
The google le system.
In Proc. of ACM OSDI, 2003.
[7] Patrick ONeil, Edward Cheng, Dieter Gawlick, and Elizabeth
ONeil.
The log-structured merge-tree (lsm-tree).
1996.
Pietro Michiardi (Eurecom) Tutorial: HBase 99 / 100
References
References III
[8] D. Salmen.
Cloud data structure diagramming techniques and design patterns.
https://www.data-tactics-corp.com/index.php/
component/jdownloads/finish/22-white-papers/
68-cloud-data-structure-diagramming, 2009.
Pietro Michiardi (Eurecom) Tutorial: HBase 100 / 100

You might also like