HBase
HBase
HBase is a data model designed to provide quick random access to huge amounts of data.
Since 1970, RDBMS is the solution for data storage and maintenance related problems. After
the advent of big data, companies realized the benefit of processing big data and started
opting for solutions like Hadoop.
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable. It leverages the fault tolerance provided
by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
HDFS is a distributed file system HBase is a database built on top of the HDFS.
suitable for storing large files.
HDFS does not support fast HBase provides fast lookups for larger tables.
individual record lookups.
It provides high latency It provides low latency access to single rows from billions of
processing records (Random access).
It provides only sequential access HBase internally uses Hash tables and provides random
of data. access, and it stores the data in indexed HDFS files for faster
lookups.
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.
Such databases are designed for small number of Column-oriented databases are designed
rows and columns. for huge tables.
HBase is schema-less, it doesn't have the concept An RDBMS is governed by its schema,
of fixed columns schema; defines only column which describes the whole structure of
families. tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
Applications of HBase
• It is used whenever there is a need to write heavy applications.
• HBase is used whenever we need to provide fast random access to available data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
HBase History
Year Event
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
HBase has three major components: the client library, a master server, and region servers.
Region servers can be added or removed as per requirement.
MasterServer
The master server -
• Assigns regions to the region servers and takes the help of ZooKeeper for this task.
• Handles load balancing of the regions across region servers. It unloads the busy
servers and shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating the load balancing.
• Is responsible for schema changes and other metadata operations such as creation of
tables and column families.
Regions
Regions are nothing but tables that are split up and spread across the region servers.
Region server
The region servers have regions that -
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is transferred
and saved in Hfiles as blocks and the memstore is flushed.
Zookeeper
• Zookeeper is an open-source project that provides services like maintaining
configuration information, naming, providing distributed synchronization, etc.
• Zookeeper has ephemeral nodes representing different region servers. Master servers
use these nodes to discover available servers.
• In addition to availability, the nodes are also used to track server failures or network
partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take care of zookeeper.