HBase
HBase
HBase
Topics at a glance
⮚Hadoop and its data access limitations.
⮚Why Hbase?
⮚Hbase and its importance in Hadoop frame work.
⮚History and Architecture of Hbase.
⮚Hbase components and their responsibilities.
⮚Hbase data storage model.
⮚Advantages and disadvantages of Hbase.
⮚Conclusion of the session
Why Hbase???
With the evolution of the internet web application scope was increased -
⮚ Huge volumes of structured and semi-structured data started getting generated.
⮚ Semi-structured data (emails, JSON, XML, and .csv files and exe files)
⮚ Loads of semi-structured data was created across the globe.
⮚ So storing and processing of this data became a major challenge.
Hadoop and its limitations
⮚Hadoop can perform only batch processing, and data will be accessed only in
a sequential manner.
⮚So if needed to access any data randomly then need a new access
methodology.
▪ HBase
▪ Cassandra,
▪ CouchDB,
▪ Dynamo and MongoDB
HBase
• HBase is an open-source NoSQL database and Part of the Hadoop framework.
• Similar to Google’s big table. Initially, it was Google Big Table, afterward; it was
renamed as HBase .
• Hbase is primarily written in Java and needed for real-time Big Data
applications.
• HBase is a distributed column-oriented non-relational database management
system that runs on top of Hadoop Distributed File System (HDFS).
• HBase is a column-oriented database and the tables in it are sorted by row.
• The table schema defines only column families, which are the key value pairs.
• It uses log storage with the Write-Ahead Logs (WAL).
• It supports fast random access and heavy writing competency.
5
How is HBase different from other NoSQL models
• HBase stores data in the form of key/value pairs in a columnar model. In this
model, all the columns are grouped together as Column families.
• HBase on top of Hadoop will increase the throughput and performance of
distributed cluster set up.
• Provides faster random reads and writes operations.
Features of Hbase
• Horizontally scalable: Can add any number of columns anytime.
• Automatic Failover: Allows a system administrator to automatically switch
data handling to a standby system in the event of system compromise/failure.
• Integrations with Map/Reduce framework: Al the commands and java codes
internally implement Map/ Reduce to do the task and it is built over Hadoop
Distributed File System.
• It doesn't enforce relationships within your data.
• It is designed to run on a cluster of computers, built using commodity
hardware.
• HBase is built for low latency operations
History of HBase
• In Nov 2006, Google released the paper on BigTable.
• Feb 2007, Initial HBase prototype was created as a Hadoop
contribution.
• Oct 2007, The first usable HBase along with Hadoop 0.15.0 was
released.
• Jan 2008, HBase became the sub project of Hadoop.
• Oct 2008, HBase 0.18.1 was released.
• Jan 2009, HBase 0.19.0 was released.
• Sept 2009, HBase 0.20.0 was released.
• May 2010, HBase became Apache top-level project.
HBase existence in the Hadoop Ecosystem
HBase Table – To store data
10
HBase: Keys and Column Families
11
Example- Storage Mechanism in HBase
Region
Server
HDFS
HBase Regions & Regions Server ..
⮚When HBase Region Server receives writes and read requests from the client, it
assigns the request to a specific region, where the actual column family resides.
⮚Client can directly contact with HRegion servers, there is no need of HMaster
mandatory permission to the client regarding communication with HRegion servers.
⮚The client requires HMaster help when operations related to metadata and schema
changes are required.
❖HMaster can get into contact with multiple HRegion servers and performs the
following functions.
1. Active HMaster
2. Inactive Hmaster
❖ HBase META Table
• META Table is a special HBase Catalog Table. Basically, it holds the location of the
regions in the HBase Cluster.
• It keeps a list of all Regions in the system.
• Structure of the .META. table is as follows:
• Key: region start key, region id
• Values: RegionServer
Hbase Table parameters
HDFS HBase
HDFS is ideally suited for write- HBase is ideally suited for random
once and read-many times use write and read of data that is
cases stored in HDFS.
25
HBase - Read
• A Read against HBase must be
reconciled between the HFiles,
MemStore & BLOCKCACHE.
• The Block Cache is designed to
keep frequently accessed data
from the HFiles in memory so as
to avoid disk reads.
• Each column family has its own
Block Cache.
Block: It is the smallest indexed
unit of data and is the smallest
unit of data that can be read from
disk. default size 64KB.
Hbase - Write
When a write is
made, by default,
it goes into two
places:
⮚write-ahead log
(WAL), Hlog.
⮚in-memory
write buffer,
MemStore.
Advantages of HBase
❖ Hbase designed to store Denormalized Data.
❖ Hbase Supports Automatic Partitioning
❖ Strong consistency model– All readers will see same value, while a write returns.
❖ Scales automatically
– While data grows too large, Regions splits automatically.
– To spread and replicate data, it uses HDFS.
❖ Built-in recovery – It uses Write Ahead Log for recovery.
❖ Integrated with Hadoop
❖ Hbase is schema-less, no data model has been defined.
❖ Hbase has the ability to perform Random read and write operations.
❖ Hbase provides data replication across clusters for higher availability.
❖ Feature random access (internal hash table) to stores data in HDFS files for faster
lookups/searching.
Disadvantages of HBase
• Single point of failure - If HMaster goes down, complete cluster will be fail
and no work/task will be performed.
• Cannot perform functions like SQL and doesn’t support SQL structure.
• Does not contain any query optimizer
• Does not support for transaction.
• Business continuity reliability
– Write Ahead Log replay very slow.
– Also, a slow complex crash recovery.
• Joining and normalization is very difficult to perform.
• Very difficult to store large binary data.
Real Time Example of HBase-Facebook
User Account Type Type of Posted Time Stamp Violating Last Login
Account ID (Personal/ Contents for community Activity
Business) Posted (Public/ standards Time of
Private) Account