0% found this document useful (0 votes)
61 views

HBase - Tutorial

Uploaded by

ucebittrichy2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

HBase - Tutorial

Uploaded by

ucebittrichy2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

HBase– Overview

Since 1970, RDBMS is the solution for data storage and maintenance related problems. After the
advent of big data, companies realized the benefit of processing big data and started opting for
solutions like Hadoop.

Hadoop uses distributed file system for storing big data, and MapReduce to process it. Hadoop
excels in storing and processing of huge data of various formats such as arbitrary, semi-, or even
unstructured.

Limitations of Hadoop

Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.

A huge dataset when processed results in another huge data set, which should also be processed
sequentially. At this point, a new solution is needed to access any point of data in a single unit of
time (random access).

Hadoop Random Access Databases

Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the
databases that store huge amounts of data and access the data in a random manner.

What is HBase?

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in
the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses
the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and
provides read and write access.
HBase and HDFS
HDFS HBase

HDFS is a distributed file system suitable


HBase is a database built on top of the HDFS.
for storing large files.

HDFS does not support fast individual


HBase provides fast lookups for larger tables.
record lookups.

It provides high latency batch


It provides low latency access to single rows from billions of
processing; no concept of batch
records (Random access).
processing.

It provides only sequential access of HBase internally uses Hash tables and provides random access,
data. and it stores the data in indexed HDFS files for faster lookups.

Storage Mechanism in HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table schema
defines only column families, which are the key value pairs. A table have multiple column
families and each column family can have any number of columns. Subsequent column values
are stored contiguously on the disk. Each cell value of the table has a timestamp. In short, in an
HBase:

 Table is a collection of rows.


 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.

Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family


col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather
than as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Analytical Processing


It is suitable for Online Transaction Process (OLTP).
(OLAP).

Such databases are designed for small number of rows Column-oriented databases are designed for
and columns. huge tables.

The following image shows column families in a column-oriented database:


HBase and RDBMS
HBase RDBMS

HBase is schema-less, it doesn't have the concept of An RDBMS is governed by its schema, which
fixed columns schema; defines only column families. describes the whole structure of tables.

It is thin and built for small tables. Hard to


It is built for wide tables. HBase is horizontally scalable.
scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as structured data. It is good for structured data.

Features of HBase

 HBase is linearly scalable.


 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

Where to Use HBase

 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts
up on Google File System, likewise Apache HBase works on top of Hadoop and HDFS.

Applications of HBase

 It is used whenever there is a need to write heavy applications.


 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase History
Year Event

Nov 2006 Google released the paper on BigTable.

Feb 2007 Initial HBase prototype was created as a Hadoop contribution.

Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
Jan 2008 HBase became the sub project of Hadoop.

Oct 2008 HBase 0.18.1 was released.

Jan 2009 HBase 0.19.0 was released.

Sept 2009 HBase 0.20.0 was released.

May 2010 HBase became Apache top-level project.

Architecture of HBase
HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.

MasterServer

The master server -


 Assigns regions to the region servers and takes the help of Apache ZooKeeper for this
task.
 Handles load balancing of the regions across region servers. It unloads the busy servers
and shifts the regions to less occupied servers.
 Maintains the state of the cluster by negotiating the load balancing.
 Is responsible for schema changes and other metadata operations such as creation of
tables and column families.

Regions

Regions are nothing but tables that are split up and spread across the region servers.

Region server

The region servers have regions that -

 Communicate with the client and handle data-related operations.


 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size thresholds.

When we take a deeper look into the region server, it contain regions and stores as shown below:

The store contains memory store and HFiles. Memstore is just like a cache memory. Anything
that is entered into the HBase is stored here initially. Later, the data is transferred and saved in
Hfiles as blocks and the memstore is flushed.
Zookeeper

 Zookeeper is an open-source project that provides services like maintaining configuration


information, naming, providing distributed synchronization, etc.
 Zookeeper has ephemeral nodes representing different region servers. Master servers use
these nodes to discover available servers.
 In addition to availability, the nodes are also used to track server failures or network
partitions.
 Clients communicate with region servers via zookeeper.
 In pseudo and standalone modes, HBase itself will take care of zookeeper.

HBase - General Commands


The general commands in HBase are status, version, table_help, and whoami. This chapter
explains these commands.

status

This command returns the status of the system including the details of the servers running on the
system. Its syntax is as follows:

hbase(main):009:0> status

If you execute this command, it returns the following output.

hbase(main):009:0> status
3 servers, 0 dead, 1.3333 average load

version

This command returns the version of HBase used in your system. Its syntax is as follows:

hbase(main):010:0> version

If you execute this command, it returns the following output.

hbase(main):009:0> version
0.98.8-hadoop2, r6cfc8d064754251365e070a10a82eb169956d5fe, Fri Nov 14
18:26:29 PST 2014
table_help

This command guides you what and how to use table-referenced commands. Given below is the
syntax to use this command.

hbase(main):02:0>table_help

When you use this command, it shows help topics for table-related commands. Given below is
the partial output of this command.

hbase(main):002:0>table_help
Help for table-reference commands.
You can either create a table via 'create' and then manipulate the table
via commands like 'put', 'get', etc.
See the standard help information for how to use each of these commands.
However, as of 0.96, you can also get a reference to a table, on which
you can invoke commands.
For instance, you can get create a table and keep around a reference to
it via:
hbase> t = create 't', 'cf'…...

whoami

This command returns the user details of HBase. If you execute this command, returns the
current HBase user as shown below.

hbase(main):008:0>whoami
hadoop (auth:SIMPLE)
groups: hadoop

Some of the Commands:


Creating a Table using HBase Shell

You can create a table using the createcommand, here you must specify the table name and the
Column Family name. The syntax to create a table in HBase shell is shown below.

create ‘<table name>’,’<column family>’

Example

Given below is a sample schema of a table named emp. It has two column families: “personal
data” and “professional data”.

Row key personal data professional data


You can create this table in HBase shell as shown below.

hbase(main):002:0> create 'emp', 'personal data', 'professional data'

And it will give you the following output.

0 row(s) in 1.1300 seconds


=>Hbase::Table - emp

Verification

You can verify whether the table is created using the list command as shown below. Here you
can observe the created emp table.

hbase(main):002:0> list
TABLE
emp
2 row(s) in 0.0340 seconds

Listing a Table using HBase Shell

list is the command that is used to list all the tables in HBase. Given below is the syntax of the
list command.

hbase(main):001:0 > list

When you type this command and execute in HBase prompt, it will display the list of all the
tables in HBase as shown below.

hbase(main):001:0> list
TABLE
emp

Dropping a Table using HBase Shell

Using the drop command, you can delete a table. Before dropping a table, you have to disable it.

hbase(main):018:0> disable 'emp'


0 row(s) in 1.4580 seconds

hbase(main):019:0> drop 'emp'


0 row(s) in 0.3060 seconds

Verify whether the table is deleted using the exists command.

hbase(main):020:07gt; exists 'emp'


Table emp does not exist
0 row(s) in 0.0730 seconds
drop_all

This command is used to drop the tables matching the “regex” given in the command. Its syntax
is as follows:

hbase>drop_all ‘t.*’

Note: Before dropping a table, you must disable it.

Example

Assume there are tables named raja, rajani, rajendra, rajesh, and raju.

hbase(main):017:0> list
TABLE
raja
rajani
rajendra
rajesh
raju
9 row(s) in 0.0270 seconds

All these tables start with the letters raj. First of all, let us disable all these tables using the
disable_all command as shown below.

hbase(main):002:0>disable_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled

Now you can delete all of them using the drop_all command as given below.

hbase(main):018:0>drop_all 'raj.*'
raja
rajani
rajendra
rajesh
raju
Drop the above 5 tables (y/n)?
y
5 tables successfully dropped
Inserting Data using HBase Shell

This chapter demonstrates how to create data in an HBase table. To create data in an HBase
table, the following commands and methods are used:

 put command,
 add() method of Put class, and
 put() method of HTable class.

As an example, we are going to create the following table in HBase.

Using put command, you can insert rows into a table. Its syntax is as follows:

put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’

Inserting the First Row

Let us insert the first row values into the emp table as shown below.

hbase(main):005:0> put 'emp','1','personal data:name','raju'


0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','personal data:city','hyderabad'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','professional
data:designation','manager'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','1','professional data:salary','50000'
0 row(s) in 0.0240 seconds
Insert the remaining rows using the put command in the same way. If you insert the whole table,
you will get the following output.

hbase(main):022:0> scan 'emp'

ROW COLUMN+CELL
1 column=personal data:city, timestamp=1417524216501, value=hyderabad

1 column=personal data:name, timestamp=1417524185058, value=ramu

1 column=professional data:designation, timestamp=1417524232601,

value=manager

1 column=professional data:salary, timestamp=1417524244109, value=50000

2 column=personal data:city, timestamp=1417524574905, value=chennai

2 column=personal data:name, timestamp=1417524556125, value=ravi

2 column=professional data:designation, timestamp=1417524592204,

value=sr:engg

2 column=professional data:salary, timestamp=1417524604221, value=30000

3 column=personal data:city, timestamp=1417524681780, value=delhi

3 column=personal data:name, timestamp=1417524672067, value=rajesh

3 column=professional data:designation, timestamp=1417524693187,

value=jr:engg
3 column=professional data:salary, timestamp=1417524702514,

value=25000

Updating Data using HBase Shell

You can update an existing cell value using the put command. To do so, just follow the same
syntax and mention your new value as shown below.

put ‘table name’,’row ’,'Column family:columnname',’new value’

The newly given value replaces the existing value, updating the row.

Example

Suppose there is a table in HBase called emp with the following data.

hbase(main):003:0> scan 'emp'


ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418051555, value = raju
row1 column = personal:city, timestamp = 1418275907, value = Hyderabad
row1 column = professional:designation, timestamp = 14180555,value = manager
row1 column = professional:salary, timestamp = 1418035791555,value = 50000
1 row(s) in 0.0100 seconds

The following command will update the city value of the employee named ‘Raju’ to Delhi.

hbase(main):002:0> put 'emp','row1','personal:city','Delhi'


0 row(s) in 0.0400 seconds

The updated table looks as follows where you can observe the city of Raju has been changed to
‘Delhi’.

hbase(main):003:0> scan 'emp'


ROW COLUMN + CELL
row1 column = personal:name, timestamp = 1418035791555, value = raju
row1 column = personal:city, timestamp = 1418274645907, value = Delhi
row1 column = professional:designation, timestamp = 141857555,value = manager
row1 column = professional:salary, timestamp = 1418039555, value = 50000
1 row(s) in 0.0100 seconds

Deleting a Specific Cell in a Table

Using the delete command, you can delete a specific cell in a table. The syntax of delete
command is as follows:

delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’

Example

Here is an example to delete a specific cell. Here we are deleting the salary.

hbase(main):006:0> delete 'emp', '1', 'personal data:city',


1417521848375
0 row(s) in 0.0060 seconds

Deleting All Cells in a Table

Using the “deleteall” command, you can delete all the cells in a row. Given below is the syntax
of deleteall command.

deleteall ‘<table name>’, ‘<row>’,

Example

Here is an example of “deleteall” command, where we are deleting all the cells of row1 of emp
table.
hbase(main):007:0>deleteall 'emp','1'
0 row(s) in 0.0240 seconds

Verify the table using the scan command. A snapshot of the table after deleting the table is given
below.

hbase(main):022:0> scan 'emp'

ROW COLUMN + CELL

2 column = personal data:city, timestamp = 1417524574905, value = chennai

2 column = personal data:name, timestamp = 1417524556125, value = ravi

2 column = professional data:designation, timestamp = 1417524204, value =


sr:engg

2 column = professional data:salary, timestamp = 1417524604221, value = 30000

3 column = personal data:city, timestamp = 1417524681780, value = delhi

3 column = personal data:name, timestamp = 1417524672067, value = rajesh

3 column = professional data:designation, timestamp = 1417523187, value =


jr:engg

3 column = professional data:salary, timestamp = 1417524702514, value = 25000

You might also like