Large-Scale Data Management: Hbase
Large-Scale Data Management: Hbase
Hbase
1
HBase: Overview
• HBase is a distributed column-oriented data
store built on top of HDFS
2
Difference
• Hive and HBase are two different Hadoop based
technologies –
3
HBase: Part of Hadoop’s
Ecosystem
4
HBase vs. HDFS
• Both are distributed systems that scale to hundreds or
thousands of nodes
5
HBase vs. HDFS (Cont’d)
• HBase is designed to efficiently address the above points
• Fast record lookup
• Support for record-level insertion
• Support for updates (not in place)
6
HBase vs. HDFS (Cont’d)
7
HBase Data Model
8
HBase Data Model
• HBase is based on Google’s Bigtable model
• Key-Value pairs
Column Family
Row key
TimeStamp value
9
HBase Logical View
10
HBase: Keys and Column
Families
Each record is divided into Column Families
11
Column family named “anchor”
Column family named “Contents”
Column
Time
Row key “content Column “anchor:”
• Key Stamp
s:”
• Byte array
“<html>
• Serves as the primary key t12
…”
for the table “com.apac
“<html>
Column named “apache.com”
• Indexed far fast lookup he.ww t11
…”
w”
• Column Family t10
“anchor:apache
.com”
“APACH
E”
• Has a name (string)
“anchor:cnnsi.co
• Contains one or more t15 “CNN”
m”
related columns
“anchor:my.look. “CNN.co
t13
ca” m”
• Column
“com.cnn.w “<html>
• Belongs to one column ww” t6
…”
family
“<html>
• Included inside the row t5
…”
• familyName:columnName “<html>
t3
…”
12
Version number for each row
Column
Time
Row key “content Column “anchor:”
Stamp
• Version Number s:”
“<html>
t5
…”
“<html>
t3
…”
13
Notes on Data Model
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
• Columns are not part of the schema
14
Notes on Data Model (Cont’d)
• The version number can be user-supplied
• Even does not have to be inserted in increasing order
• Version number are unique within each key
16
HBase Physical Model
• Each column family is stored in a separate file (called HTables)
• Key & Version numbers are replicated with each column family
17
Example
18
Column Families
19
HBase Regions
• Each HTable (column family) is partitioned horizontally
into regions
• Regions are counterpart to HDFS blocks
20
HBase Architecture
21
Three Major Components
• The HBaseMaster
• One master
• The HRegionServer
• Many region servers
22
HBase Architecture
• In HBase, tables are split into regions and are served by
the region servers.
23
HBase Architecture
MasterServer
• Assigns regions to the region servers and takes
the help of Apache ZooKeeper for this task.
• Handles load balancing of the regions across
region servers.
• It unloads the busy servers and shifts the
regions to less occupied servers.
• Maintains the state of the cluster by negotiating
the load balancing.
• Is responsible for schema changes and other
metadata operations such as creation of tables
and column families. 24
Regions
26
Zookeeper
• Zookeeper is an open-source project that
provides services like maintaining configuration
information, naming, providing distributed
synchronization, etc.
• Zookeeper has ephemeral nodes representing
different region servers. Master servers use these
nodes to discover available servers.
• In addition to availability, the nodes are also
used to track server failures or network
partitions.
• Clients communicate with region servers via
zookeeper.
• In pseudo and standalone modes, HBase itself
will take care of zookeeper.
27
Big Picture
28
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’
Time
Row key Column “anchor:”
Stamp
t12
t11
“com.apache.www”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
Select value from table
Scan() where anchor=‘cnnsi.com’
Time
Row key Column “anchor:”
Stamp
t12
t11
“com.apache.www”
t9 “anchor:cnnsi.com” “CNN”
t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6
t5
t3
Operations On Regions: Delete()
• Multiple levels
• Can mark an entire column family as deleted
• Can make all column families of a given row as deleted
31
HBase: Joins
• HBase does not support joins
32
Logging Operations
33
HBase Deployment
Master
node
Slave
nodes
34
HBase vs. RDBMS
35
When to use HBase
36