0% found this document useful (0 votes)
77 views

Large-Scale Data Management: Hbase

HBase is a distributed column-oriented database built on top of HDFS that provides Bigtable-like capabilities for the Hadoop ecosystem. It stores data in tables containing rows, columns, and versions organized into column families that are partitioned into regions distributed across HBase region servers which handle read and write requests under the management and coordination of the HBase master server. HBase provides a distributed, scalable, big data store with real-time read/write random access capabilities.

Uploaded by

raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
77 views

Large-Scale Data Management: Hbase

HBase is a distributed column-oriented database built on top of HDFS that provides Bigtable-like capabilities for the Hadoop ecosystem. It stores data in tables containing rows, columns, and versions organized into column families that are partitioned into regions distributed across HBase region servers which handle read and write requests under the management and coordination of the HBase master server. HBase provides a distributed, scalable, big data store with real-time read/write random access capabilities.

Uploaded by

raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Large-Scale Data Management

Hbase

1
HBase: Overview
• HBase is a distributed column-oriented data
store built on top of HDFS

• HBase is an Apache open source project whose goal


is to provide storage for the Hadoop Distributed
Computing

• Data is logically organized into tables, rows and


columns

2
Difference
• Hive and HBase are two different Hadoop based
technologies –

• Hive is an SQL-like engine that runs MapReduce jobs,


and

• HBase is a NoSQL key/value database on Hadoop.

• Just like Google can be used for search and Facebook


for social networking, Hive can be used for analytical
queries while HBase for real-time querying.

3
HBase: Part of Hadoop’s
Ecosystem

HBase is built on top of HDFS

HBase files are


internally stored
in HDFS

4
HBase vs. HDFS
• Both are distributed systems that scale to hundreds or
thousands of nodes

• HDFS is good for batch processing (scans over big files)


• Not good for record lookup
• Not good for incremental addition of small batches
• Not good for updates

5
HBase vs. HDFS (Cont’d)
• HBase is designed to efficiently address the above points
• Fast record lookup
• Support for record-level insertion
• Support for updates (not in place)

• HBase updates are done by creating new versions of


values

6
HBase vs. HDFS (Cont’d)

If application has neither random reads or writes  Stick to HDFS

7
HBase Data Model

8
HBase Data Model
• HBase is based on Google’s Bigtable model
• Key-Value pairs

Column Family

Row key

TimeStamp value

9
HBase Logical View

10
HBase: Keys and Column
Families
Each record is divided into Column Families

Each row has a Key

Each column family consists of one or more Columns

11
Column family named “anchor”
Column family named “Contents”

Column
Time
Row key “content Column “anchor:”
• Key Stamp
s:”
• Byte array
“<html>
• Serves as the primary key t12
…”
for the table “com.apac
“<html>
Column named “apache.com”
• Indexed far fast lookup he.ww t11
…”
w”
• Column Family t10
“anchor:apache
.com”
“APACH
E”
• Has a name (string)
“anchor:cnnsi.co
• Contains one or more t15 “CNN”
m”
related columns
“anchor:my.look. “CNN.co
t13
ca” m”
• Column
“com.cnn.w “<html>
• Belongs to one column ww” t6
…”
family
“<html>
• Included inside the row t5
…”
• familyName:columnName “<html>
t3
…”

12
Version number for each row

Column
Time
Row key “content Column “anchor:”
Stamp
• Version Number s:”

• Unique within each “<html>


t12
key …” value
“com.apac
“<html>
• By default System’s he.ww
w”
t11
…”
timestamp t10
“anchor:apache “APACH
.com” E”
• Data type is Long
“anchor:cnnsi.co
t15 “CNN”
m”
• Value (Cell) “anchor:my.look. “CNN.co
t13
ca” m”
• Byte array
“com.cnn.w “<html>
t6
ww” …”

“<html>
t5
…”
“<html>
t3
…”

13
Notes on Data Model
• HBase schema consists of several Tables
• Each table consists of a set of Column Families
• Columns are not part of the schema

• HBase has Dynamic Columns


• Because column names are encoded inside the cells
• Different cells can have different columns

“Roles” column family


has different columns
in different cells

14
Notes on Data Model (Cont’d)
• The version number can be user-supplied
• Even does not have to be inserted in increasing order
• Version number are unique within each key

• Table can be very sparse


Has two columns
• Many cells are empty [cnnsi.com & my.look.ca]

• Keys are indexed as the primary key


HBase Physical Model

16
HBase Physical Model
• Each column family is stored in a separate file (called HTables)

• Key & Version numbers are replicated with each column family

• Empty cells are not stored

17
Example

18
Column Families

19
HBase Regions
• Each HTable (column family) is partitioned horizontally
into regions
• Regions are counterpart to HDFS blocks

Each will be one region

20
HBase Architecture

21
Three Major Components
• The HBaseMaster
• One master

• The HRegionServer
• Many region servers

• The HBase client

22
HBase Architecture
• In HBase, tables are split into regions and are served by
the region servers.

• Regions are vertically divided by column families into


“Stores”.

• HBase has three major components: the client library, a


master server, and region servers.

• Region servers can be added or removed as per


requirement.

23
HBase Architecture
MasterServer
• Assigns regions to the region servers and takes
the help of Apache ZooKeeper for this task.
• Handles load balancing of the regions across
region servers.
• It unloads the busy servers and shifts the
regions to less occupied servers.
• Maintains the state of the cluster by negotiating
the load balancing.
• Is responsible for schema changes and other
metadata operations such as creation of tables
and column families. 24
Regions

• Regions are nothing but tables that are


split up and spread across the region
servers.

• Communicate with the client and


handle data-related operations.

• Handle read and write requests for all


the regions under it.

• When we take a deeper look into the


region server, it contain regions and
stores as shown:
25
• The store contains memory store and
HFiles. Memstore is just like a cache
memory.

• Anything that is entered into the


HBase is stored here initially.

• Later, the data is transferred and saved


in Hfiles as blocks and the memstore is
flushed.

26
Zookeeper
• Zookeeper is an open-source project that
provides services like maintaining configuration
information, naming, providing distributed
synchronization, etc.
• Zookeeper has ephemeral nodes representing
different region servers. Master servers use these
nodes to discover available servers.
• In addition to availability, the nodes are also
used to track server failures or network
partitions.
• Clients communicate with region servers via
zookeeper.
• In pseudo and standalone modes, HBase itself
will take care of zookeeper.

27
Big Picture

28
Select value from table where
Get() key=‘com.apache.www’ AND
label=‘anchor:apache.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Select value from table
Scan() where anchor=‘cnnsi.com’

Time
Row key Column “anchor:”
Stamp

t12

t11
“com.apache.www”

t10 “anchor:apache.com” “APACHE”

t9 “anchor:cnnsi.com” “CNN”

t8 “anchor:my.look.ca” “CNN.com”
“com.cnn.www”
t6

t5

t3
Operations On Regions: Delete()

• Marking table cells as deleted

• Multiple levels
• Can mark an entire column family as deleted
• Can make all column families of a given row as deleted

31
HBase: Joins
• HBase does not support joins

• Can be done in the application layer


• Using scan() and get() operations

32
Logging Operations

33
HBase Deployment

Master
node

Slave
nodes

34
HBase vs. RDBMS

35
When to use HBase

36

You might also like