Big Data – Introduction to Hadoop
Big Data – Introduction to Hadoop
Big Data – Introduction to Hadoop
Hadoop
Lecture 2
Big Data Challenges
• Need to store lots of data – Move from single
share storage to Distributed storage
• Need faster processing – Move from Serial
processing to Parallel processing
• Need to deal with unstructured Data too
Hadoop
• Hadoop is a framework for storing data on large
clusters of commodity hardware - everyday hardware
that is affordable and easily available - and running
applications against that data
• A cluster is a group of interconnected computers,
known as nodes, that work together on the same
problem
• Using networks of affordable compute resources to
acquire business insight is the what Hadoop is about
Hadoop is a solution to the Big Data
problem
• Hadoop is technology to store massive datasets on a
cluster of cheap commodity machines in a distributed
manner
• It provides Big Data analytics through distributed
computing framework
• Hadoop is an open source software platform and used
for processing large volume of data processing at
massive scale. Handles unstructured data
• Hadoop architecture has 2 core components: Storage
known as Hadoop Distributed File System(HDFS) and
Computing stuff using MapReduce(MR).
Hadoop Ecosystem
Hadoop ecosystem – Apache Hive
• Open source system for querying and analysing large
datasets stored in Hadoop (HDFS) files.
• Hive has 3 main functions: Data summarisation, Query,
and Analysis
• Uses an SQL type language called HiveQL (HQL).
HiveQL automatically translates SQL-like queries into
MapReduce jobs (parrallel processing) which will
execute on Hadoop
• HQL is declarative language operates on the server
• Developed by Facebook
Hadoop ecosystem - Apache Pig
• Apache Pig is a high level procedural language platform
for analysing and querying huge dataset that are stored in
HDFS
• Pig uses PigLatin language.
• It operates on the client side
• Developed by Yahoo
• Instead of writing Java code to implement MapReduce
you can chose either Pig Latin and Hive SQL languages to
construct MapReduce programs.
– Benefit of coding in Pig and Hive is that is much fewer lines of
code and reduces the overall development and testing time
Hadoop ecosystem – Apache HBase
• Hadoop perform batch processing and data is accessed
sequentially. One has to search the entire dataset even for
the simplest of jobs – v slow!
• It provides random real time read/write access to data in
HDFS
• Is a distributed database that was designed to store billions
of row and millions of columns.
• HBase is scalable, distributed, and NoSQL database that is
built on top of HDFS.
• HBase provide real-time access to read or write data in HDFS
• It’s origin from Google Bigtable
Hadoop Ecosystem
• Apache Sqoop imports data from external sources
into related Hadoop ecosystem. It also exports
data from Hadoop to other external sources.
Sqoop works with RDBMS like Oracle, MySQL
• Apache Flume efficiently collects, aggregate and
moves a large amount of data from its origin and
sending it back to HDFS. It is fault tolerant and
reliable mechanism. Used for streaming logs and
event based
Hadoop Architecture
• Hadoop Common: Are Java libraries and utilities required by
other Hadoop modules. These libraries provides filesystem
and OS level abstractions and contains the necessary Java files
and scripts required to start Hadoop
• Hadoop YARN: This is a framework for job scheduling and
cluster resource management
• Hadoop Distributed File System (HDFS): A distributed file
system that provides high throughput access to application
data
• Hadoop MapReduce: This is YARN based system for parallel
processing of large data sets
Hadoop Architecture
Hadoop - a reliable shared data storage and data analysis system
Storage
HDFS
Hadoop
MapReduce/YARN
Compute
Responsible for Massive Parellel
Processing
Hadoop
• Hadoop cluster is a form of compute cluster used mainly for
computational purposes. Consists of many computers
(compute nodes) in a cluster and can share computational
workloads & take advantage of a very large aggregate
bandwidth across the cluster
• Hadoop clusters typically consist of a few master nodes
(NameNode) which control the storage and processing systems
in Hadoop and many slave nodes (DataNodes) which store all
the cluster data and is also where the data gets processed
• Hadoop runs on a group of networked computers
• In Hadoop terminology each computer is a node
• The whole group is referred to as a cluster
Key to Hadoop’s Power
• Process data in parallel across thousands of
‘commodity’ hardware nodes
– Self- healing
• Failure handled by software
Researcher laptop
R, Matlab, SAS, etc
Measure,
Measure/Evaluate
Evaluate
Model
Model Acquire
Acquire
Visualize
Visualize Clean data
Clean Data
Hadoop Architecture comprises three
major layers
• They are:-
– Data Storage: HDFS (Hadoop Distributed File
System)
– Resource management: Yarn
– Compute: MapReduce
Why is Hadoop popular?
• Apache Hadoop is a platform for data storage as well as
processing. It is Scalable (add more nodes on the fly),
Fault-tolerant (even if nodes crashes data processed by
another node).
• The following makes it a powerful platform:
– Flexibility to store and mine any type of data whether it is
unstructured, semi-structured or structured. It is not schema on
write and not bounded by a single schema
– Excels at processing complex data. Its horizontally scales-out
and divides workloads across many nodes
– Scales economically as it can deployed on commodity hardware.
Also its open source nature guards against vendor lock
Hadoop Features
• 1. Reliability
If any node goes down it will not disable the whole cluster - another
node will take the place of the failed node. Hadoop cluster will
continue functioning as nothing has happened – it has built in fault
tolerance feature
• 2. Scalable
Keep adding commodity servers to horizontally to scale up. If you are
installing Hadoop on the cloud it is even simpler as you can easily
procure more hardware and expand the Hadoop cluster within minutes
• 3. Economical
Hadoop gets deployed on commodity hardware which are cheap.
Moreover Hadoop is an open system software there is no cost of
license
Hadoop Features (Cont)
• 4. Distributed Processing
Any job submitted by the client gets divided into the number
of subtasks and these subtasks are independent of each other.
They execute in parallel giving high throughput.
• 5. Distributed Storage
Splits each file into the number of blocks. These blocks get
distributed on the cluster of machines.
• 6. Fault Tolerance
Replicates every block of file many times depending on the
replication factor (3 by default). If any node goes down then
the data on that node gets recovered as this copy of the data
would be available on other nodes due to replication
Hadoop Flavours
• Apache – Vanilla flavour, as the actual code is
residing in Apache repositories
• Cloudera – It is the most popular in the industry.
• Hortonworks – Popular distribution in the
industry (Azure used it)
• MapR – It has rewritten HDFS and its HDFS is
faster as compared to others.
• IBM – Proprietary distribution is known as Big
Insights.
Examples 1 Hadoop usage: Log Analysis
• Logs that record data about the web pages that people visit and in
which order they visit them
• Example: a typical web based browsing and buying experience:
– You surf a website looking for an item to buy
– You click to read descriptions of a product that you like
– You add an item to your shopping cart and proceed to the checkout
– After seeing the cost of shipping you decide that the item is not
worth it and close the browser
– Every click you have made and then stopped has the potential to
offer valuable insight
• Ecommerce businesses need to understand factors behind abandoned
shopping carts. When you perform deeper analysis on the clickstream
data patterns emerge
Log Analysis - Continued
• Questions we may ask :
– Are certain products abandoned more than others?
– How much revenue can be recaptured if you
decrease cart abandonment by 10 percent?
• Hadoop has the ability to analysing log data as
the data can be v large. Tools like Pig and Hive
allow basic log analysis to be done
• What would you do without Hadoop? Ignore
such Log analysis probably!
Example 2 Hadoop usage: Risk Modeling
• More data allows you to “connect the dots” and produce better risk
prediction models
• Hadoop allows data sets used in risk models to include underutilised or
sources that were previously never utilised eg social media, email, instant
messaging, etc, data sources.
• Hadoop is ideal for Building and stress-testing risk models. These
operations are often computationally intensive and impractical to run
against a traditional Data Warehouse because:
– The warehouse probably is not optimised for such queries issued by the risk
model - Hadoop is not bound by the Data Models used in Data warehouses
– A large ad hoc batch job such as risk model would increase the load on the
warehouse degrading its performance - Hadoop can assume this workload and
freeing up the warehouse for it regular reports
– Advanced risk models likely to need unstructured data as input and Hadoop can
handle that
Hadoop is a powerful tool (but not a
solution)
• Hadoop is not the complete solution. Although
risk management leverage the strengths of
Hadoop, Hadoop by itself does not solve these
issues
• Deveopers need to write code with an
understanding of the problem so that they utilise
Hadoop strength to solve the business problem.
– For example, it does not help in picking up unusual
patterns – it just allows large data to be
processed concurrently
What Hadoop can do
• The differentiator that Hadoop brings is that
now you can do the same things you could do
with a traditional Database but on a much
larger scale and get better results.
• It allows you to manage petabytes of data
which is unheard of in the database world
previously
From Macy’s
Class work 1
• In the analogy in the previous slide, what else
would you add/change for RDBMS?
• What would you change for Hadoop?
Hadoop Distributed Filesystem (HDFS) Scale
out well
• Large HDFS deployments run on many tens of
thousands of machines with storage capacity of many
hundreds of Petabytes
• Viable because of cost of data storage and access on
HDFS using commodity hardware and open source
software
• The emphasis is on high throughput of data access
rather than low latency of data access
– Latency is time require to perform an action & measured in
unit of time eg seconds
– Throughput is no of such actions executed per unit time
HDFS Architecture
• HDFS has one NameNode and series of DataNodes.
Application data is stored on DataNodes and file
system metadata is stored on NameNode.
• HDFS replicates the file content on multiple Data
Nodes based on the replication factor to ensure
reliability of data.
• Hadoop brings power of distributed data
processing where data is stored. It allows massive
parallel processing in a distributed manner
HDFS – Distributed file system in more
detail
• HDFS provides a distributed architecture for extremely
large scale storage and can easily be extended by scaling
out as opposed to scale up
• Compared to storing files on your PC, files in Hadoop is
designed to work optimally with a small number of very
large files like 500MB
• HDFS has a Write Once, Read Often model of data access.
That means the contents of individual files cannot be
modified – you can append new data to the end of the file.
• HDFS uses sequential access pattern – access data in
sequence thereby seek once (unlike random access)
HDFS –write once read often
• HDFS files you can do:
– Create a new file
– Append content to the end of a file
– Delete a file
– Rename a file
– Modify file attributes like owner
Hadoop Distributed File System (HFDS)
• Designed for large-scale data processing
– Allows storage of large data files > 100 TB in one file
• Runs on top of native file system on each node
(Linux or Windows)
• Data is stored in units known as blocks
– 128MB (default) can be larger
• Large datasets are stored in many blocks over
many nodes
– Enables dataset to be bigger than individual disks in
Hadoop cluster
• Blocks are replicated to multiple nodes
– Allows for node failure without loss of data
HDFS default block
size 128MB
35
36
HDFS – has a Master/Slave Architecture
A file in an HDFS namespace is split into several blocks and those blocks
are stored in a set of DataNodes. The NameNode determines the
mapping of blocks to the DataNodes.
The DataNodes takes care of read and write operation with the file
system. DataNodes manage the storage that associated with the nodes
on which they run, serving client read and write requests. They also
take care of block creation, deletion and replication based on
instruction given by NameNode.
HDFS
• When you store a file in HDFS, the system breaks it down
into a set of individual blocks and stores these blocks in
various DataNodes (slave nodes) in the Hadoop cluster
• HDFS does not know what is stored inside the file, so raw
files are not split in accordance with rules that we humans
would understand
• The concept of storing a file as a collection of blocks is
entirely consistent with how file systems normally work but
not the scale.
– A typical block size in a file system under Linux is 4KB, whereas a
typical block size in Hadoop is 128MB (in Hadoop 2.X) - value is
configurable & can be customised
HDFS Large Block Size – why?
• Hadoop was designed to store data at the petabyte scale with
no scale out bottlenecks
• The high block size helps to store big data because:
– Each data block stored in HDFS has its own metadata and needs to be
tracked by a central server so that applications needing to access a
specific file can be directed to wherever all the file’s blocks are stored.
If block size were in the KB range, even modest volumes of data in the
terabytes would be too much for metadata server with too many
blocks to track.
– HDFS is designed to enable high throughput so that the parallel
processing of these large data happens. Having too large a block size
(10s GB) may make parallel processing slow and inefficient.
• With less blocks, the file will be stored on less nodes and this reduces degree
of parallel processing
HDFS – Replicating data blocks
• HDFS assumes that every disk drive and every
slave node is unreliable
• Care must be taken in choosing where the
three copies of the data blocks are stored
• Data blocks are striped across the Hadoop
cluster - they are evenly distributed between
the Datanodes (slave node) so a copy of the
block will still be available regardless of disk,
node, or rack failures
HDFS
• NameNode (or Master) is a high end machine
where as DataNode (Slaves) are inexpensive
computers
• The Big Data files get divided into the number
of blocks. Hadoop stores these blocks in a
distributed fashion on the cluster of Datanode
(slave nodes).
• On the NameNode (master), we have
metadata stored.
HDFS has 2 daemon process running:
• NameNode : NameNode performs following
functions:
– NameNode Daemon runs on the master machine.
– It is responsible for maintaining, monitoring and
managing DataNodes
– Records the metadata of the files like the location of
blocks, file size, permission, hierarchy etc.
– Namenode captures all the changes to the metadata like
deletion, creation and renaming of the file in edit logs.
– It regularly receives heartbeat and block reports from
the DataNodes.
HDFS Daemon processes
• DataNode: The various functions of DataNode
are as follows:
– DataNode runs on the slave machine.
– It stores the actual data
– It serves the read/write request from the user
– DataNode does the ground work of creating,
replicating and deleting the blocks on the command
of NameNode.
– After every 3 seconds it sends heartbeat to
NameNode reporting the health of HDFS
Disk failures are inevitable
• If disk fails the cluster continues functioning but performance
degrades because processing resources is lost. The system is
still online & all data is still available.
• If disk fails, the Namenode (the central metadata server for
HDFS) finds out the file blocks stored on it and thereby
detects block is under –replicated and it creates new copy
the failed block
• When the failed disk comes back online HDFS ensures there
are three copies of all the file blocks. But now there is 4 copies
and it is over replicated - As with under-replicated blocks the
namenode will detect this and will order one copy of every
file to be deleted
HDFS is based on share nothing principle
• These instructions use the Cloudera QuickStart VM which runs inside the VirtualBox virtual environment,
which needs to be installed first.
• The Cloudera QuickStart VM is a single-node cluster making it easy to quickly get hands on experience with
Hadoop, Hive and Pig.
• Note that the Cloudera Quickstart VM archive is a large file in excess of 5GB so it might make sense to
download it overnight (you might need to disable your computers sleep function).
Video
• https://www.youtube.com/watch?
v=DCaiZq3aBSc&t=582s