VOOZH about

URL: https://www.geeksforgeeks.org/dbms/architecture-of-hbase/

⇱ Architecture of HBase - GeeksforGeeks


  • Courses
  • Tutorials
  • Interview Prep

Architecture of HBase

Last Updated : 8 Dec, 2025

HBase is a distributed, scalable, NoSQL database built on top of Hadoop. It is designed to store huge amounts of structured or semi-structured data and provide fast, random read/write access. To achieve this, HBase relies on three main components in its architecture: HMaster, Region Server, and ZooKeeper.

👁 Image

HMaster

The HMaster acts as the main coordinator of the HBase cluster.
Think of it as the manager that oversees how data is distributed and how the cluster functions.

Key Roles of HMaster

  • Assigns regions (data chunks) to Region Servers
  • Manages table operations like create, delete, and modify
  • Monitors the health of Region Servers
  • Balances load across servers
  • Handles failover when a server crashes

In large clusters, multiple backup HMasters run to ensure high availability.

Region Server

HBase tables are very large, so they are divided horizontally into smaller parts called Regions. A Region Server is responsible for managing these regions.

What Region Servers Do

  • Store and manage regions, each containing data for a specific row-key range
  • Handle read and write requests from clients
  • Store data in column families, which are the basic storage units in HBase
  • Run on top of HDFS DataNodes, making use of Hadoop’s storage

Each region is around 256 MB by default, and new regions are automatically created as the table grows.

ZooKeeper

ZooKeeper works like a traffic controller for the HBase cluster.

ZooKeeper Responsibilities

  • Helps clients find which Region Server holds which data
  • Monitors server failures and helps in quick recovery
  • Maintains cluster configuration
  • Provides distributed synchronization

Without ZooKeeper, coordination between HMaster, Region Servers, and clients would not be possible.

How HBase Works

How Data is Written in HBase (Write Path)

Flow: Client → Region Server → WAL → MemStore → HFile

When you write data to HBase, here’s what actually happens:

1. The client sends a write request

Just like sending a message to a server saying, “Please save this data.”

2. Region Server writes to WAL (Write Ahead Log)

WAL is like a safety notebook.

Before HBase stores data in memory, it writes a copy to WAL so that nothing gets lost if the server crashes.
Think of WAL as saving a draft before writing the final version.

3. Data goes into MemStore (memory buffer)

This is a temporary holding area in RAM.

MemStore collects recent writes, making the system very fast because writing to memory is much quicker than writing to disk.

4. When MemStore becomes full, data is flushed to disk

Once the MemStore reaches a certain size, HBase saves its content permanently to disk as an HFile in HDFS.

This is like moving items from your desk (fast access) into a file cabinet (permanent storage).

5. Compaction happens in the background

Over time, many small HFiles get created.
HBase merges these smaller files into larger ones, which:

  • Reduces storage space
  • Speeds up read operations
  • Keeps data organized

This process is called compaction.

How Data is Read in HBase (Read Path)

Flow: Client → Region Server → BlockCache → MemStore → HFile

When a client wants to read data, HBase tries to return the answer as fast as possible.

1. Client contacts ZooKeeper

ZooKeeper tells the client which Region Server holds the data it needs.
This avoids confusion and saves time.

2. Region Server checks BlockCache (fastest place)

BlockCache is like the recently used memory (similar to how your phone keeps recently used apps active).

If the requested data is here → instant answer.

3. If not, it checks MemStore

MemStore may still have some recent writes that were not flushed to HDFS yet.

4. Finally, it looks into HFile (stored in HDFS)

If the data is not found in cache or MemStore, the Region Server reads it from the actual HFiles stored in HDFS.

This is the slowest option, but still efficient.

Why are reads fast?

Because most of the time:

  • Recently used data is in BlockCache
  • Recently written data is in MemStore

So HBase often returns results without touching the disk.

Advantages of HBase

  • Handles massive datasets easily
  • Scales horizontally just by adding more machines
  • Cost-effective for storing gigabytes to petabytes of data
  • High availability due to replication and failover
  • Suitable for real-time read/write workloads

Disadvantages of HBase

  • Does not support SQL queries (NoSQL model)
  • No full ACID transactions
  • Rows are sorted only by row key
  • Requires careful memory management in large clusters

HBase vs HDFS

FeatureHBaseHDFS
Access PatternLow-latency reads/writesHigh-latency, batch processing
Data AccessRandom read/writeWrite once, read many
APIsShell, Java, REST, Thrift, AvroMostly MapReduce
Use CaseReal-time dataLarge file storage & batch jobs

Key Features of HBase Architecture

Distributed & Scalable

HBase can grow across hundreds or thousands of machines, allowing it to store enormous datasets.

Column-oriented Storage

Data is stored in column families, making read/write operations faster for specific columns.

Tight Hadoop Integration

Built on HDFS and works seamlessly with MapReduce and other Hadoop tools.

Strong Consistency

Every read or write operation is consistent across the cluster.

Built-in Caching

Frequently accessed data is cached in memory for faster performance.

Data Compression

Reduces storage usage and speeds up data retrieval.

Flexible Schema

Columns can be added dynamically without redefining the entire table—ideal for evolving data.

Real-world Use Case

HBase is popular for online analytical workloads. For example, banks use HBase for real-time ATM transaction updates, where fast and consistent data operations are crucial.

Comment
Article Tags:
Article Tags:

Explore