Chap4_BigDataStorageAndManagement
Chap4_BigDataStorageAndManagement
Chap4_BigDataStorageAndManagement
Yijie Zhang
New Jersey Institute of Technology
1
Remind -- Apache Hadoop
The Apache Hadoop® project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than relying on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers, each of which may
be prone to failures.
http://hadoop.apache.org
2
Remind -- Hadoop-related Apache Projects
http://hortonworks.com/hadoop/hdfs/
• Namenode: This is the daemon that runs on all the masters. Name node stores metadata like filename, the
number of blocks, number of replicas, a location of blocks, block IDs etc. This metadata is available in
memory in the master for faster retrieval of data. In the local disk, a copy of metadata is available for
persistence. So name node memory should be high as per the requirement.
• Datanode: This is the daemon that runs on the slave. These are actual worker nodes that store the data.
4
Data Storage Operations on HDFS
5
HDFS blocks
• File is divided into blocks (default: 64MB in Hadoop 1 and 128 MB in Hadoop 2) and
duplicated in multiple places (default: 3, which could be changed to the required values
according to the requirement by editing the configuration files hdfs-site.xml)
• Dividing into blocks is normal for a native file system, e.g., the default block size in Linux
is 4KB. The difference of HDFS is the scale.
• Hadoop was designed to operate at the petabyte scale.
• Every data block stored in HDFS has its own metadata and needs to be tracked by a
central server.
Why replicate? How to set the replication number?
• Reliability
• Performance
• When HDFS stores the replicas of the original blocks across the Hadoop cluster, it tries
to ensure that the block replicas are stored at different failure points.
• Rack-aware replica placement to improve data reliability, availability, and network
bandwidth utilization, in a hierarchical (multi-rack, multi-node) architecture
• NameNode places replicas of a block on multiple racks for improved fault tolerance.
• tries to place at least one replica of a block in each rack, so that if a complete
rack goes down, the system will be still available on other racks.
7
HDFS is a User-Space-Level file system
8
Interaction between HDFS components
9
HDFS Federation
• Before Hadoop 2.0
• NameNode was a single point of failure and operation limitation.
• Hadoop clusters could hardly scale beyond 3,000 or 4,000 nodes.
• In Hadoop 2.x
• Multiple NameNodes can be used (HDFS High Availability feature – one is
in an Active state, the other one is in a Standby state).
http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html 10
High Availability of the NameNodes
• Active NameNode
• Standby NameNode – keeping the state of the block locations and block metadata in
memory -> HDFS checkpointing responsibilities.
• JournalNode – if a failure occurs, the Standby Node reads all completed journal entries to
ensure the new Active NameNode is fully consistent with the state of the cluster.
• Zookeeper – provides coordination and configuration services for distributed systems.
11
Data Compression in HDFS
12
HDFS Commands
• For HDFS, the schema name is hdfs, and for the local file system, the schema name is file.
• A file or directory in HDFS can be specified in a fully qualified way, such as:
hdfs://namenodehost/parent/child or hdfs://namenodehost
• The HDFS file system shell command is similar to Linux file commands, with the following
general syntax: hadoop fs or hdfs dfs –file_cmd
13
Some useful
commands for
HDFS
14
Big Data Management
• File-based Data Management (Data Encoding Format)
• JSON
• XML
• CSV
• Hierarchical Data Format (HDF4/5)
• Network Common Data Form (netCDF)
• System-based Data Management (NoSQL Database)
• Key-Value Store (the most primary unit for others)
• Document Store (mongoDB)
• Tabular Store (HBase)
• Object Database
• Graph Database
• Property graphs
• Resource Description Framework (RDF) graphs
15
Commonly Used Data Encoding Formats
JSON (JavaScript Object Notation, .json)
• An open-standard language-
independent data format
• Use text to transmit data objects:
attribute–value pairs and array data
types
• Used for asynchronous browser–
server communication
16
Commonly Used Data Encoding Formats
XML (Extensible Markup Language, .xml)
17
Commonly Used Data Encoding Formats
CSV (Comma-Separated Values, .csv)
• A delimited data format
• Fields/columns are separated by the comma character
• Records/rows are terminated by newlines
• All records have the same number of fields in the same order
• Any field may be quoted
18
Commonly Used Data Encoding Formats
Hierarchical Data Format (HDF4/5, .hdf)
• A set of file formats (HDF4, HDF5) designed to store and organize large amounts of data
• HDF5 simplifies the file structure to include only two major types
• Datasets, which are multidimensional arrays of a homogeneous type
• Groups, which are container structures that can hold datasets and other groups
19
Commonly Used Data Encoding Formats
Network Common Data Form (netCDF, .nc)
• A set of self-describing, machine-independent data formats that support the creation,
access, and sharing of array-oriented scientific data
• Starting with version 4.0, the netCDF API allows the use of the HDF5 data format
• An extension of netCDF for parallel computing called Parallel-NetCDF (or PnetCDF) has
been developed by Argonne National Laboratory and Northwestern University
20
System-based NoSQL Databases: Key-Value Store
• Considered as the most primary and the simplest version of all NoSQL databases
• Use a one-way mapping from the key to the value to store a basic data item
21
NoSQL: Document Store
22
NoSQL: Graph Database
Graph Models
• Labeled-Property Graphs
• Represented by a set of nodes, relationships, properties, and labels
• Both nodes of data and their relationships are named and can store properties represented by key/value pairs
• RDF (Resource Description Framework: Triplestore) Graphs
Apache TinkerPop is a graph computing framework for both
graph databases (OLTP: Online Transactional Processing) and
graph analytic systems (OLAP: Online Analytical Processing).
Amazon Neptune Fast, reliable graph database built for the cloud
•ArangoDB - OLTP Provider for ArangoDB.
•Bitsy - A small, fast, embeddable, durable in-memory graph database.
•Blazegraph - RDF graph database with OLTP support. How is a graph stored?
•CosmosDB - Microsoft's distributed OLTP graph database.
•ChronoGraph - A versioned graph database. • Linked list
•DSEGraph - DataStax graph database with OLTP and OLAP support.
•GRAKN.AI - Distributed OLTP/OLAP knowledge graph system. • Adjacency matrix
•Hadoop (Spark) - OLAP graph processor using Spark.
•HGraphDB - OLTP graph database running on Apache HBase.
•Huawei Graph Engine Service - Fully-managed, distributed, at-scale graph query/analysis service that provides a visualized interactive
analytics platform.
•IBM Graph - OLTP graph database as a service.
•JanusGraph - Distributed OLTP and OLAP graph database with BerkeleyDB, Apache Cassandra and Apache HBase support.
•JanusGraph (Amazon) - The Amazon DynamoDB Storage Backend for JanusGraph.
•Neo4j - OLTP graph database (embedded and high availability).
•neo4j-gremlin-bolt - OLTP graph database (using Bolt Protocol).
•OrientDB - OLTP graph database
•Apache S2Graph - OLTP graph database running on Apache HBase.
•Sqlg - OLTP implementation on SQL databases.
•Stardog - RDF graph database with OLTP and OLAP support.
•TinkerGraph - In-memory OLTP and OLAP reference implementation.
•Titan - Distributed OLTP and OLAP graph database with BerkeleyDB, Apache Cassandra and Apache HBase support.
•Titan (Amazon) - The Amazon DynamoDB storage backend for Titan.
•Titan (Tupl) - The Tupl storage backend for Titan.
•Unipop - OLTP Elasticsearch and JDBC backed graph. 23
NoSQL: Graph Database Use Cases
Knowledge Graphs
Example query:
I’m interested in The Mona Lisa.
Help me find other artworks
• by Leonardo da Vinci, or
• located in The Louvre.
“1973”
home
Larry Page “Palo Alto” OpenGL
“1850”
Software “4.1”
Charles died
“1934”
Flint Android Linux
Internet Google kernel
industry
“Armonk” “4.0”
Hardware 54,604
IBM “Mountain
View”
“1955”
433,362
Services died
Steve Jobs “2011”
“7.1”
Example query:
What is the backlog of the product
purchased by User A?
Retrieving multi-step relationships is a 'graph traversal' problem Cited “Graph Database” O’liey 2013 25
Preliminary datastore comparison for Recommendation & Visualization
item
user
27
NoSQL Tabular Database: HBase
• HBase is modeled after Google’s BigTable and written in Java, and is developed on top
of HDFS
Source: Lars, George. HBase The Definitive Guide. O'Reilly Media. 2011 0.92 release
Hadoop’s contribution
BigTable paper
29
When to use HBase?
• Not suitable for every problem
– Compared to RDBMs, it has VERY simple and limited APIs
• Good for large amounts of data
– 100s of millions or billions of rows
– If data is too small all the records will end up on a single node leaving the rest of the cluster idle
• Must have enough hardware!!
– At the minimum 5 nodes
• There are multiple management daemon processes: Namenode, HBaseMaster, Zookeeper, etc....
• HDFS won't do well on anything under 5 nodes anyway; particularly with a block replication of 3
• HBase is memory and CPU intensive
• Carefully evaluate HBase for mixed workloads
– Client request (interactive, time-sensitive) vs. Batch processing (MapReduce)
• SLAs on client requests would need evaluation
– HBase has intermittent but large I/O access
• May affect response latency!
• Two well-known use cases
– Lots and lots of data (already mentioned)
– Large amounts of clients/requests (usually cause a lot of data)
• Great for single random selects and range scans by key
• Great for variable schema
– Rows may drastically differ
– If your schema has many columns and most of them are null
30
When NOT to use HBase?
• Bad for traditional RDBMS retrieval
– Transactional applications
– Relational analytics
• 'group by', 'join', and 'where column like', etc....
• Currently bad for text-based search access
– There is work being done in this arena
• HBasene: https://github.com/akkumar/hbasene/wiki
• HBASE-3529: 100% integration of HBase and Lucene based on HBase' coprocessors
– Some projects provide solutions that use HBase
• Lily=HBase+Solr http://www.lilyproject.org
31
HBase Data Model
• Data is stored in Tables
• Tables contain rows
– Rows are referenced by a unique (Primary) key
• Key is an array of bytes – good news
• Anything can be a key: string, long and your own serialized data structures
• Rows made of columns
• Data is stored in cells
– Identified by “row x column-family:column”
– Cell’s content is also an array of bytes
HBase Families column family 1 (user) column family 2
• Columns are grouped into families Column
– Labeled as “family:column” (first_name)
• Example “user:first_name”
– A way to organize your data
row
– Various features are applied to families
• Compression
• In-memory option
• Stored together - in a file called HFile/StoreFile
• Family definitions are static
– Created with table, should be rarely added and changed
– Limited to a small number of families
• unlike columns that you can have millions of
32
HBase Distributed Architecture
• Table is made of regions
• Region – a range of rows stored together
– Single shard, used for scaling
– Dynamically split as they become too big and merged if too small
• Region Server – serves one or more regions
– A region is served by only 1 Region Server
• Master Server – daemon responsible for managing HBase cluster, or Region
Servers
• HBase stores its data into HDFS
– Relies on HDFS’s high availability and fault-tolerance features
• HBase utilizes Zookeeper for distributed coordination
Zookeeper
Master
/hbase/region1
/hbase/region2
Region
RegionServers
Servers …
Region
…
Servers
/hbase/regionx
memstore
33
Row Distribution Between Region Servers
Rows
A1
Region
A2
null-> B11
…
B11 Region
… B12-> F34
…
F34 Region
… F35-> O90
A logical view with …
all rows in a table …
O90 Region
… O91-> Z30
… Region
… Z31-> null
Z30
…
Region Server Region Server Region Server
Z55
35
HBase Architecture
36
HBase Deployment on HDFS
Zookeeper Zookeeper Zookeeper
HBase HDFS HDFS
Master Namenode Secondary
Namenode
Resources Books
• Home Page • HBase: The Definitive Guide by Lars George
– Publication Date: September 20, 2011
– http://hbase.apache.org
• Mailing Lists • Apache HBase Reference Guide
– http://hbase.apache.org/mail-lists.html – Comes packaged with HBase
– Subscribe to User List – http://hbase.apache.org/book/book.html
• Wiki
– http://wiki.apache.org/hadoop/Hbase • Hadoop: The Definitive Guide by Tom White
– Publication Date: May 22, 2012
• Videos and Presentations – Chapter about Hbase
– http://hbase.apache.org/book.html#other.info
37
Characteristics of Data in HBase
Sparse data
Multiple versions of
data for each cell
HDFS lacks random read and write access. This is where HBase comes into picture. It's
a distributed, scalable, big data store, modeled after Google's BigTable. It stores data
as key/value pairs.
38
HBase Example
39
HBase Example
40
HBase Example
41
HBase Example
42
Create HBase table in Java
HBase Table Mapper and Reducer
Ingesting Data into HDFS/HBase – Apache Flume
Flume Features:
• Ingest log data from multiple web servers into a centralized store (HDFS, HBase) efficiently
• Import huge volumes of event data produced by social networking sites like Facebook and Twitter, and e-
commerce websites like Amazon and Flipkart, along with the log files
• Support a large set of sources and destinations types
• Support multi-hop flows, fan-in fan-out flows, contextual routing, etc.
• Can be scaled horizontally 45
Questions?
46