0% found this document useful (0 votes)
17 views

HBase

Uploaded by

mytempemail2023
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

HBase

Uploaded by

mytempemail2023
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

HBase

HBase is a data model designed to provide quick random access to huge amounts of data.

Since 1970, RDBMS is the solution for data storage and maintenance related problems. After
the advent of big data, companies realized the benefit of processing big data and started
opting for solutions like Hadoop.

What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable. It leverages the fault tolerance provided
by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.

HBase and HDFS


HDFS HBase

HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.

HDFS does not support fast HBase provides fast lookups for larger tables.
individual record lookups.

It provides high latency It provides low latency access to single rows from billions of
processing records (Random access).
It provides only sequential access HBase internally uses Hash tables and provides random
of data. access, and it stores the data in indexed HDFS files for faster
lookups.

Storage Mechanism in HBase


HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table has multiple column
families and each column family can have any number of columns. Subsequent column
values are stored contiguously on the disk. Each cell value of the table has a timestamp. In
short, in an HBase:

• Table is a collection of rows.


• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process It is suitable for Online Analytical


(OLTP). Processing (OLAP).

Such databases are designed for small number of Column-oriented databases are designed
rows and columns. for huge tables.

The following image shows column families in a column-oriented database:

HBase and RDBMS


HBase RDBMS

HBase is schema-less, it doesn't have the concept An RDBMS is governed by its schema,
of fixed columns schema; defines only column which describes the whole structure of
families. tables.

It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for unstructured, semi-structured as well It is good for structured data.


as structured data.
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.

Where to Use HBase


• HBase is used to have random, real-time read/write access to Big Data.
• It hosts very large tables on top of clusters of commodity hardware.
• HBase is a non-relational database. HBase works on top of Hadoop and HDFS.

Applications of HBase
• It is used whenever there is a need to write heavy applications.
• HBase is used whenever we need to provide fast random access to available data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase History
Year Event

Nov 2006 Google released the paper on BigTable.

Feb 2007 Initial HBase prototype was created as a Hadoop contribution.

Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.

Jan 2008 HBase became the sub project of Hadoop.

Oct 2008 HBase 0.18.1 was released.

Jan 2009 HBase 0.19.0 was released.

Sept 2009 HBase 0.20.0 was released.

May 2010 HBase became Apache top-level project.


In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.

HBase has three major components: the client library, a master server, and region servers.
Region servers can be added or removed as per requirement.

MasterServer
The master server -
• Assigns regions to the region servers and takes the help of ZooKeeper for this task.
• Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating the load balancing.
• Is responsible for schema changes and other metadata operations such as creation of
tables and column families.

Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -

• Communicate with the client and handle data-related operations.


• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size thresholds.
When we take a deeper look into the region server, it contains regions and stores as shown
below:

The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is transferred
and saved in Hfiles as blocks and the memstore is flushed.

Zookeeper
• Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
• Zookeeper has ephemeral nodes representing different region servers. Master servers
use these nodes to discover available servers.
• In addition to availability, the nodes are also used to track server failures or network
partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take care of zookeeper.

You might also like