UNIT 3
UNIT 3
1. Introduction
A column-oriented database (also known as a column-family database) is a type of
NoSQL database that stores data in columns rather than rows. This design is optimized for
analytics, fast read performance, and scalability.
Key Features
Stores data by columns instead of rows.
Optimized for analytical queries (e.g., OLAP workloads).
Efficient compression and faster retrieval of specific columns.
Used in big data applications, data warehouses, and real-time analytics.
The table in HBase is split into regions and served by the region servers in HBase. Regions
are vertically divided by column familied into "stores." Stores are usually saved as files in
HDFS. HBase runs on top of HDFS (Hadoop Distributed File System).
HBase architecture consists mainly of five components
HMaster
HRegionserver
HRegions
Zookeeper
HDFS
Below is a detailed architecture of HBase with components:
HMaster
The client communicates in a bi-directional way with both HMaster and ZooKeeper. For read
and write operations, it directly contacts with HRegion servers. HMaster assigns regions to
region servers and in turn, check the health status of region servers.
In entire architecture, we have multiple region servers. Hlog present in region servers which
are going to store all the log files.
HBase Region Servers
When HBase Region Server receives writes and read requests from the client, it assigns the
request to a specific region, where the actual column family resides. However, the client can
directly contact with HRegion servers, there is no need of HMaster mandatory permission to
the client regarding communication with HRegion servers. The client requires HMaster help
when operations related to metadata and schema changes are required.
HMaster can get into contact with multiple HRegion servers and performs the following
functions.
Hosting and managing regions
Splitting regions automatically
Handling read and writes requests
Communicating with the client directly
HBase Regions
HRegions are the basic building elements of HBase cluster that consists of the distribution of
tables and are comprised of Column families. It contains multiple stores, one for each column
family. It consists of mainly two components, which are Memstore and Hfile.
ZooKeeper
HBase Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. Distributed synchronization is to
access the distributed applications running across the cluster with the responsibility of
providing coordination services between nodes. If the client wants to communicate with
regions, the server’s client has to approach ZooKeeper first.
Master and HBase slave nodes ( region servers) registered themselves with ZooKeeper. The
client needs access to ZK(zookeeper) quorum configuration to connect with master and
region servers.
During a failure of nodes that present in HBase cluster, ZKquoram will trigger error
messages, and it starts to repair the failed nodes.
HDFS
HDFS is a Hadoop distributed File System, as the name implies it provides a distributed
environment for the storage and it is a file system designed in a way to run on commodity
hardware. It stores each file in multiple blocks and to maintain fault tolerance, the blocks are
replicated across a Hadoop cluster.
HDFS provides a high degree of fault –tolerance and runs on cheap commodity hardware. By
adding nodes to the cluster and performing processing & storing by using the cheap
commodity hardware, it will give the client better results as compared to the existing one.
In here, the data stored in each block replicates into 3 nodes any in a case when any node
goes down there will be no loss of data, it will have a proper backup recovery mechanism.
HDFS get in contact with the HBase components and stores a large amount of data in a
distributed manner.
HBase Data Model is a set of components that consists of Tables, Rows, Column families,
Cells, Columns, and Versions. HBase tables contain column families and rows with elements
defined as Primary keys. A column in HBase data model table represents attributes to the
objects.
HBase Data Model consists of following elements,
Set of tables
Each table with column families and rows
Each table must have an element defined as Primary Key.
Row key acts as a Primary key in HBase.
Any access to HBase tables uses this Primary Key
Each column present in HBase denotes attribute corresponding to object
HBase Use Cases
Following are examples of HBase use cases with a detailed explanation of the solution it
provides to various technical problems
Problem Statement Solution
Telecom Industry faces following
technical challenges
Storing billions of CDR (Call
HBase is used to store billions of rows of detailed
detailed recording) log records
call records. If 20TB of data is added per month to
generated by telecom domain
the existing RDBMS database, performance will
Providing real-time access to CDR
deteriorate. To handle a large amount of data in this
logs and billing information of
use case, HBase is the best solution. HBase performs
customers
fast querying and displays records.
Provide cost-effective solution
comparing to traditional database
systems
The Banking industry generates millions
To store, process and update vast volumes of data
of records on a daily basis. In addition to
and performing analytics, an ideal solution is –
this, the banking industry also needs an
HBase integrated with several Hadoop ecosystem
analytics solution that can detect Fraud in
components.
money transactions
HBase is a column-oriented database and data is stored in tables. The tables are sorted by
RowId. As shown below, HBase has RowId, which is the collection of several column
families that are present in the table.
The column families that are present in the schema are key-value pairs. If we observe in
detail each column family having multiple numbers of columns. The column values stored
into disk memory. Each cell of the table has its own Metadata like timestamp and other
information.
The amount of data that can able to store in this model is It is designed for a small number of
very huge like in terms of petabytes rows and columns.
Step 1) Client wants to write data and in turn first communicates with Regions server and
then regions
Step 2) Regions contacting memstore for storing associated with the column family
Step 3) First data stores into Memstore, where the data is sorted and after that, it flushes into
HFile. The main reason for using Memstore is to store data in a Distributed file system based
on Row Key. Memstore will be placed in Region server main memory while HFiles are
written into HDFS.
Step 4) Client wants to read data from Regions
Step 5) In turn Client can have direct access to Mem store, and it can request for data.
Step 6) Client approaches HFiles to get the data. The data are fetched and retrieved by the
Client.
Memstore holds in-memory modifications to the store. The hierarchy of objects in HBase
Regions is as shown from top to bottom in below table.
Table HBase table present in the HBase cluster
Region HRegions for the presented tables
Store It stores per ColumnFamily for each region for the table
Memstore for each store for each region for the table
Memstore It sorts data before flushing into HFiles
Write and read performance will increase because of sorting
StoreFile StoreFiles for each store for each region for the table
Block Blocks present inside StoreFiles
HBASE HDFS
Low latency operations High latency operations
Random reads and writes Write once Read many times
Accessed through shell commands, client API in Primarily accessed through MR (Map Reduce)
Java, REST, Avro or Thrift jobs
Storage and process both can be perform It’s only for storage areas
Some typical IT industrial applications use HBase operations along with Hadoop.
Applications include stock exchange data, online banking data operations, and processing
Hbase is best-suited solution method.
Summary
HBase architecture components: HMaster, HRegion Server, HRegions, ZooKeeper,
HDFS
HMaster in HBase is the implementation of a Master server in HBase architecture.
When HBase Region Server receives writes and read requests from the client, it
assigns the request to a specific region, where the actual column family resides
HRegions are the basic building elements of HBase cluster that consists of the
distribution of tables and are comprised of Column families.
HBase Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization.
HDFS provides a high degree of fault–tolerance and runs on cheap commodity
hardware.
HBase Data Model is a set of components that consists of Tables, Rows, Column
families, Cells, Columns, and Versions.
Column and Row-oriented storages differ in their storage mechanism.
2. Key Characteristics
Schema-less – No predefined schema, supports dynamic fields.
Hierarchical Data Model – Stores complex, nested structures in a single document.
Efficient Querying – Supports indexing, filtering, and full-text search.
Horizontal Scalability – Uses sharding and replication for distribution.
1. Documents
Core storage unit, typically in JSON, BSON, or XML format.
Stores key-value pairs, arrays, and nested objects.
Example (JSON Document in MongoDB):
{
"_id": "12345",
"name": "John Doe",
"email": "[email protected]",
"orders": [
{ "product": "Laptop", "price": 1000 },
{ "product": "Mouse", "price": 50 }
]
}
2. Collections
Group of related documents, similar to a table in relational databases.
Documents in a collection do not need to have the same structure.
3. Indexing Mechanism
Uses B-Tree and Hash Indexes for fast data retrieval.
Supports compound, geospatial, and text indexes.
4. Storage Engine
MongoDB (WiredTiger, MMAPv1) – Optimized for concurrent reads/writes.
CouchDB (Append-only B+ Tree) – Ensures durability using Multi-Version
Concurrency Control (MVCC).
5. Replication & Sharding
Replication – Copies data across multiple nodes for high availability.
Sharding – Distributes documents across servers based on a shard key.
Key/Value stores are a type of NoSQL database that store data as a collection of
key-value pairs.
They provide fast, scalable, and efficient access to data, making them ideal for
caching, real-time applications, and session storage.
Memcached and Redis are two of the most widely used in-memory key/value
stores.
2. Memcached vs Redis: Overview
4. Redis Internals
a) Architecture
In-memory key-value store with support for complex data structures (Strings,
Lists, Sets, Hashes, Sorted Sets).
Supports persistence via RDB (snapshot) and AOF (append-only file).
Master-slave replication for high availability.
Pub/Sub messaging for real-time applications.
b) How Redis Works
1. Data is stored in RAM for fast access.
2. If persistence is enabled, Redis periodically saves snapshots to disk.
3. Redis supports automatic failover using Redis Sentinel.
4. Supports Lua scripting, transactions, and cluster mode for scaling.
c) Advantages of Redis Over Memcached
Persistent storage (RDB, AOF).
Rich data types (Lists, Sets, Hashes, etc.).
Replication & clustering for high availability.
Atomic operations & transactions support.
Use Cases
Example