0% found this document useful (0 votes)
2 views61 pages

Big Data – Introduction to Hadoop

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 61

Big Data – Introduction to

Hadoop
Lecture 2
Big Data Challenges
• Need to store lots of data – Move from single
share storage to Distributed storage
• Need faster processing – Move from Serial
processing to Parallel processing
• Need to deal with unstructured Data too
Hadoop
• Hadoop is a framework for storing data on large
clusters of commodity hardware - everyday hardware
that is affordable and easily available - and running
applications against that data
• A cluster is a group of interconnected computers,
known as nodes, that work together on the same
problem
• Using networks of affordable compute resources to
acquire business insight is the what Hadoop is about
Hadoop is a solution to the Big Data
problem
• Hadoop is technology to store massive datasets on a
cluster of cheap commodity machines in a distributed
manner
• It provides Big Data analytics through distributed
computing framework
• Hadoop is an open source software platform and used
for processing large volume of data processing at
massive scale. Handles unstructured data
• Hadoop architecture has 2 core components: Storage
known as Hadoop Distributed File System(HDFS) and
Computing stuff using MapReduce(MR).
Hadoop Ecosystem
Hadoop ecosystem – Apache Hive
• Open source system for querying and analysing large
datasets stored in Hadoop (HDFS) files.
• Hive has 3 main functions: Data summarisation, Query,
and Analysis
• Uses an SQL type language called HiveQL (HQL).
HiveQL automatically translates SQL-like queries into
MapReduce jobs (parrallel processing) which will
execute on Hadoop
• HQL is declarative language operates on the server
• Developed by Facebook
Hadoop ecosystem - Apache Pig
• Apache Pig is a high level procedural language platform
for analysing and querying huge dataset that are stored in
HDFS
• Pig uses PigLatin language.
• It operates on the client side
• Developed by Yahoo
• Instead of writing Java code to implement MapReduce
you can chose either Pig Latin and Hive SQL languages to
construct MapReduce programs.
– Benefit of coding in Pig and Hive is that is much fewer lines of
code and reduces the overall development and testing time
Hadoop ecosystem – Apache HBase
• Hadoop perform batch processing and data is accessed
sequentially. One has to search the entire dataset even for
the simplest of jobs – v slow!
• It provides random real time read/write access to data in
HDFS
• Is a distributed database that was designed to store billions
of row and millions of columns.
• HBase is scalable, distributed, and NoSQL database that is
built on top of HDFS.
• HBase provide real-time access to read or write data in HDFS
• It’s origin from Google Bigtable
Hadoop Ecosystem
• Apache Sqoop imports data from external sources
into related Hadoop ecosystem. It also exports
data from Hadoop to other external sources.
Sqoop works with RDBMS like Oracle, MySQL
• Apache Flume efficiently collects, aggregate and
moves a large amount of data from its origin and
sending it back to HDFS. It is fault tolerant and
reliable mechanism. Used for streaming logs and
event based
Hadoop Architecture
• Hadoop Common: Are Java libraries and utilities required by
other Hadoop modules. These libraries provides filesystem
and OS level abstractions and contains the necessary Java files
and scripts required to start Hadoop
• Hadoop YARN: This is a framework for job scheduling and
cluster resource management
• Hadoop Distributed File System (HDFS): A distributed file
system that provides high throughput access to application
data
• Hadoop MapReduce: This is YARN based system for parallel
processing of large data sets
Hadoop Architecture
Hadoop - a reliable shared data storage and data analysis system

• Hadoop consists of 2 main components:


– a distributed file system - the Hadoop distributed file system
(HDFS)
– a distributed processing framework - MapReduce
(supported by a component called YARN)
• Application running on Hadoop gets its work divided
among the nodes (machines) in the cluster and HDFS
stores the data that will be processed
• A Hadoop cluster can span 1000s of machines where
HDFS stores data and MapReduce jobs do their
processing near the data which reduces I/O
Responsible for storing Data in Blocks

Storage
HDFS

Hadoop

MapReduce/YARN

Compute
Responsible for Massive Parellel
Processing
Hadoop
• Hadoop cluster is a form of compute cluster used mainly for
computational purposes. Consists of many computers
(compute nodes) in a cluster and can share computational
workloads & take advantage of a very large aggregate
bandwidth across the cluster
• Hadoop clusters typically consist of a few master nodes
(NameNode) which control the storage and processing systems
in Hadoop and many slave nodes (DataNodes) which store all
the cluster data and is also where the data gets processed
• Hadoop runs on a group of networked computers
• In Hadoop terminology each computer is a node
• The whole group is referred to as a cluster
Key to Hadoop’s Power
• Process data in parallel across thousands of
‘commodity’ hardware nodes
– Self- healing
• Failure handled by software

• Designed for one write, multiple reads


– Optimised for minimum seek on hard drives
– Cannot update data once written

• Data and data processing tightly integrated


– Designed to work together
What can Hadoop do?
• Explore full datasets
• Mine larger datasets
– Too large for conventional memory based processing
• Large scale data preparation
– Cleaning, pre processing
• Accelerate data driven innovation
– RDBMS uses ‘schema on write’
• ER diagrams, 3rd Normal form…
– Hadoop uses ‘schema on read’
• Dump the data on the HDFS (Data Lakes but not Data
Swamps!)
Explore Large Datasets Directly with Hadoop

Researcher laptop
R, Matlab, SAS, etc

Measure,
Measure/Evaluate
Evaluate

Model
Model Acquire
Acquire

Full dataset stored on Hadoop

Visualize
Visualize Clean data
Clean Data
Hadoop Architecture comprises three
major layers
• They are:-
– Data Storage: HDFS (Hadoop Distributed File
System)
– Resource management: Yarn
– Compute: MapReduce
Why is Hadoop popular?
• Apache Hadoop is a platform for data storage as well as
processing. It is Scalable (add more nodes on the fly),
Fault-tolerant (even if nodes crashes data processed by
another node).
• The following makes it a powerful platform:
– Flexibility to store and mine any type of data whether it is
unstructured, semi-structured or structured. It is not schema on
write and not bounded by a single schema
– Excels at processing complex data. Its horizontally scales-out
and divides workloads across many nodes
– Scales economically as it can deployed on commodity hardware.
Also its open source nature guards against vendor lock
Hadoop Features
• 1. Reliability
If any node goes down it will not disable the whole cluster - another
node will take the place of the failed node. Hadoop cluster will
continue functioning as nothing has happened – it has built in fault
tolerance feature
• 2. Scalable
Keep adding commodity servers to horizontally to scale up. If you are
installing Hadoop on the cloud it is even simpler as you can easily
procure more hardware and expand the Hadoop cluster within minutes
• 3. Economical
Hadoop gets deployed on commodity hardware which are cheap.
Moreover Hadoop is an open system software there is no cost of
license
Hadoop Features (Cont)
• 4. Distributed Processing
Any job submitted by the client gets divided into the number
of subtasks and these subtasks are independent of each other.
They execute in parallel giving high throughput.
• 5. Distributed Storage
Splits each file into the number of blocks. These blocks get
distributed on the cluster of machines.
• 6. Fault Tolerance
Replicates every block of file many times depending on the
replication factor (3 by default). If any node goes down then
the data on that node gets recovered as this copy of the data
would be available on other nodes due to replication
Hadoop Flavours
• Apache – Vanilla flavour, as the actual code is
residing in Apache repositories
• Cloudera – It is the most popular in the industry.
• Hortonworks – Popular distribution in the
industry (Azure used it)
• MapR – It has rewritten HDFS and its HDFS is
faster as compared to others.
• IBM – Proprietary distribution is known as Big
Insights.
Examples 1 Hadoop usage: Log Analysis
• Logs that record data about the web pages that people visit and in
which order they visit them
• Example: a typical web based browsing and buying experience:
– You surf a website looking for an item to buy
– You click to read descriptions of a product that you like
– You add an item to your shopping cart and proceed to the checkout
– After seeing the cost of shipping you decide that the item is not
worth it and close the browser
– Every click you have made and then stopped has the potential to
offer valuable insight
• Ecommerce businesses need to understand factors behind abandoned
shopping carts. When you perform deeper analysis on the clickstream
data patterns emerge
Log Analysis - Continued
• Questions we may ask :
– Are certain products abandoned more than others?
– How much revenue can be recaptured if you
decrease cart abandonment by 10 percent?
• Hadoop has the ability to analysing log data as
the data can be v large. Tools like Pig and Hive
allow basic log analysis to be done
• What would you do without Hadoop? Ignore
such Log analysis probably!
Example 2 Hadoop usage: Risk Modeling
• More data allows you to “connect the dots” and produce better risk
prediction models
• Hadoop allows data sets used in risk models to include underutilised or
sources that were previously never utilised eg social media, email, instant
messaging, etc, data sources.
• Hadoop is ideal for Building and stress-testing risk models. These
operations are often computationally intensive and impractical to run
against a traditional Data Warehouse because:
– The warehouse probably is not optimised for such queries issued by the risk
model - Hadoop is not bound by the Data Models used in Data warehouses
– A large ad hoc batch job such as risk model would increase the load on the
warehouse degrading its performance - Hadoop can assume this workload and
freeing up the warehouse for it regular reports
– Advanced risk models likely to need unstructured data as input and Hadoop can
handle that
Hadoop is a powerful tool (but not a
solution)
• Hadoop is not the complete solution. Although
risk management leverage the strengths of
Hadoop, Hadoop by itself does not solve these
issues
• Deveopers need to write code with an
understanding of the problem so that they utilise
Hadoop strength to solve the business problem.
– For example, it does not help in picking up unusual
patterns – it just allows large data to be
processed concurrently
What Hadoop can do
• The differentiator that Hadoop brings is that
now you can do the same things you could do
with a traditional Database but on a much
larger scale and get better results.
• It allows you to manage petabytes of data
which is unheard of in the database world
previously
From Macy’s
Class work 1
• In the analogy in the previous slide, what else
would you add/change for RDBMS?
• What would you change for Hadoop?
Hadoop Distributed Filesystem (HDFS) Scale
out well
• Large HDFS deployments run on many tens of
thousands of machines with storage capacity of many
hundreds of Petabytes
• Viable because of cost of data storage and access on
HDFS using commodity hardware and open source
software
• The emphasis is on high throughput of data access
rather than low latency of data access
– Latency is time require to perform an action & measured in
unit of time eg seconds
– Throughput is no of such actions executed per unit time
HDFS Architecture
• HDFS has one NameNode and series of DataNodes.
Application data is stored on DataNodes and file
system metadata is stored on NameNode.
• HDFS replicates the file content on multiple Data
Nodes based on the replication factor to ensure
reliability of data.
• Hadoop brings power of distributed data
processing where data is stored. It allows massive
parallel processing in a distributed manner
HDFS – Distributed file system in more
detail
• HDFS provides a distributed architecture for extremely
large scale storage and can easily be extended by scaling
out as opposed to scale up
• Compared to storing files on your PC, files in Hadoop is
designed to work optimally with a small number of very
large files like 500MB
• HDFS has a Write Once, Read Often model of data access.
That means the contents of individual files cannot be
modified – you can append new data to the end of the file.
• HDFS uses sequential access pattern – access data in
sequence thereby seek once (unlike random access)
HDFS –write once read often
• HDFS files you can do:
– Create a new file
– Append content to the end of a file
– Delete a file
– Rename a file
– Modify file attributes like owner
Hadoop Distributed File System (HFDS)
• Designed for large-scale data processing
– Allows storage of large data files > 100 TB in one file
• Runs on top of native file system on each node
(Linux or Windows)
• Data is stored in units known as blocks
– 128MB (default) can be larger
• Large datasets are stored in many blocks over
many nodes
– Enables dataset to be bigger than individual disks in
Hadoop cluster
• Blocks are replicated to multiple nodes
– Allows for node failure without loss of data
HDFS default block
size 128MB
35
36
HDFS – has a Master/Slave Architecture

HDFS uses a master/slave architecture where master consists of a single


NameNode that manages the file system metadata and one or more
slave DataNodes that store the actual data.

A file in an HDFS namespace is split into several blocks and those blocks
are stored in a set of DataNodes. The NameNode determines the
mapping of blocks to the DataNodes.

The DataNodes takes care of read and write operation with the file
system. DataNodes manage the storage that associated with the nodes
on which they run, serving client read and write requests. They also
take care of block creation, deletion and replication based on
instruction given by NameNode.
HDFS
• When you store a file in HDFS, the system breaks it down
into a set of individual blocks and stores these blocks in
various DataNodes (slave nodes) in the Hadoop cluster
• HDFS does not know what is stored inside the file, so raw
files are not split in accordance with rules that we humans
would understand
• The concept of storing a file as a collection of blocks is
entirely consistent with how file systems normally work but
not the scale.
– A typical block size in a file system under Linux is 4KB, whereas a
typical block size in Hadoop is 128MB (in Hadoop 2.X) - value is
configurable & can be customised
HDFS Large Block Size – why?
• Hadoop was designed to store data at the petabyte scale with
no scale out bottlenecks
• The high block size helps to store big data because:
– Each data block stored in HDFS has its own metadata and needs to be
tracked by a central server so that applications needing to access a
specific file can be directed to wherever all the file’s blocks are stored.
If block size were in the KB range, even modest volumes of data in the
terabytes would be too much for metadata server with too many
blocks to track.
– HDFS is designed to enable high throughput so that the parallel
processing of these large data happens. Having too large a block size
(10s GB) may make parallel processing slow and inefficient.
• With less blocks, the file will be stored on less nodes and this reduces degree
of parallel processing
HDFS – Replicating data blocks
• HDFS assumes that every disk drive and every
slave node is unreliable
• Care must be taken in choosing where the
three copies of the data blocks are stored
• Data blocks are striped across the Hadoop
cluster - they are evenly distributed between
the Datanodes (slave node) so a copy of the
block will still be available regardless of disk,
node, or rack failures
HDFS
• NameNode (or Master) is a high end machine
where as DataNode (Slaves) are inexpensive
computers
• The Big Data files get divided into the number
of blocks. Hadoop stores these blocks in a
distributed fashion on the cluster of Datanode
(slave nodes).
• On the NameNode (master), we have
metadata stored.
HDFS has 2 daemon process running:
• NameNode : NameNode performs following
functions:
– NameNode Daemon runs on the master machine.
– It is responsible for maintaining, monitoring and
managing DataNodes
– Records the metadata of the files like the location of
blocks, file size, permission, hierarchy etc.
– Namenode captures all the changes to the metadata like
deletion, creation and renaming of the file in edit logs.
– It regularly receives heartbeat and block reports from
the DataNodes.
HDFS Daemon processes
• DataNode: The various functions of DataNode
are as follows:
– DataNode runs on the slave machine.
– It stores the actual data
– It serves the read/write request from the user
– DataNode does the ground work of creating,
replicating and deleting the blocks on the command
of NameNode.
– After every 3 seconds it sends heartbeat to
NameNode reporting the health of HDFS
Disk failures are inevitable
• If disk fails the cluster continues functioning but performance
degrades because processing resources is lost. The system is
still online & all data is still available.
• If disk fails, the Namenode (the central metadata server for
HDFS) finds out the file blocks stored on it and thereby
detects block is under –replicated and it creates new copy
the failed block
• When the failed disk comes back online HDFS ensures there
are three copies of all the file blocks. But now there is 4 copies
and it is over replicated - As with under-replicated blocks the
namenode will detect this and will order one copy of every
file to be deleted
HDFS is based on share nothing principle

• Unlike shared disk approach eg Network Attached


Storage (NAS) and Storage Area Network (SAN)
architectures
• Shared Disk storage implemented by centralised
storage application with custom hardware and
special network infrastructure
• Shared nothing approach does not require any
special hardware (expensive) but only commodity
computers connected by a conventional datacentre
network
A basic Analogy for parallel processing
• Suppose you have to build one house. One person is
working on it and is expecting to be completed in a year
• If we will introduce 12 people to do that job, it will be
constructed faster. Can they complete in a month or 2?
• They will work parallel and the task will get completed
faster
• Hadoop gives the ability to perform distributed storage
and distributed processing of very large data sets on
computer clusters built from commodity hardware
What is MapReduce?
• Built on the concept of divide and conquer by using
distributed computing and parallel processing
• Faster to break massive tasks into smaller chunks,
allocate them to multiple servers and process them in
parallel
• Programming logic is placed as close to the data as
possible. This technique works with both structured and
unstructured data
• With Distributed file system (HDFS) the files are already
divided into bite sized pieces. MapReduce is what you
use to process all the pieces.
MapReduce overview
• In the Map phase of MapReduce, records from
a large data source are divided up and
processed across as many servers as possible
in parallel to produce intermediate values
• After Map processing is completed the
intermediate results are collected and
combined or reduced into final value
MapReduce advantages
• Parallel Processing: By dividing the job among multiple nodes
and each node works with a part of the job simultaneously and
therefore time taken to process data gets reduced significantly
• Data Locality: Instead of moving data to the processing unit, we
move the processing unit to the data
– In the traditional system, we used to bring data to the processing unit
and process it. As data grew and became very huge, bringing this
huge amount of data to the processing unit gave the following issues:
• Moving huge data to processing is costly and deteriorates the network
performance.
• Processing takes time as the data is processed by a single unit which becomes
the bottleneck.
Map Reduce
• MapReduce processes a sequence of operations on distributed data sets. The
data consists of key-value pairs, and the computations have only two phases: a
map phase and a reduce phase. User-defined MapReduce jobs run on the
compute nodes in the cluster.
• Map Reduce consists of:
– 1 During the Map phase, input data is split into a large number of fragments, each of
which is assigned to a map task.
– 2 These map tasks are distributed across the cluster.
– 3 Each map task processes the key-value pairs from its assigned fragment and
produces a set of intermediate key-value pairs.
– 4 The intermediate data set is sorted by key, and the sorted data is partitioned into a
number of fragments that matches the number of reduce tasks.
– 5 During the Reduce phase, each reduce task processes the data fragment that was
assigned to it and produces an output key-value pair.
– 6 These reduce tasks are also distributed across the cluster and write their output to
HDFS when finished.
YARN (short for Yet Another Resource
Manager)
• YARN provides generic scheduling and
resource management services so that you
can run more than just Map Reduce
applications on your Hadoop cluster. The
JobTracker/TaskTracker architecture could only
run MapReduce.
Class Work 2
• What are the advantages of Hadoop?
• What are the Disadvantages of Hadoop?
Advantages of Hadoop
• Varied Data Sources
– Supports variety of un/structured data eg email, social
media etc. It can accept data in a text file, XML file, images,
CSV files etc.
• Cost effective
– Uses commodity hardware to store data and this is cheap
• Performance
– With its distributed processing & storage architecture it
processes huge amounts of data with high throughput. It
divides input data file into a number of blocks and stores
data in these blocks over several nodes. It also divides task
that user submits into various sub-tasks and executes
them in parallel
Advantages of Hadoop (Cont
• Highly Available
– In Hadoop 2.x, HDFS architecture has a single active NameNode and a single
Standby NameNode so if a NameNode goes down then we have standby
NameNode to count on. But Hadoop 3.0 supports multiple standby
NameNode making the system even more highly available as it can continue
working if two or more NameNodes crash
• Low Network Traffic
– Each job submitted by the user is split into a number of independent sub-
tasks and these sub-tasks are assigned to the data nodes thereby moving a
small amount of code to data rather than moving huge data to code which
leads to low network traffic
• High Throughput
– Throughput means job done per unit time. A given job gets divided into
small jobs which work on chunks of data in parallel thereby giving high
throughput.
Hadoop Advantages (cont)
• Open Source
– Hadoop is open source technology i.e. its source code is freely available.
• Scalable
– Hadoop works on the principle of horizontal scalability scale out rather
than scale up
• Fault Tolerant
– In Hadoop 3.0 fault tolerance provided by erasure coding. For example, 6
data blocks produce 3 parity blocks by using erasure coding technique,
so HDFS stores a total of these 9 blocks. In event of failure of any node
the data block affected can be recovered by using these parity blocks and
the remaining data blocks
• Emerging technology of Big Data are compatible with Hadoop like Spark,
Flink. They have got processing engines which work over Hadoop as a
backend - Hadoop is the data storage platforms for them.
Disadvantage of Hadoop
• Hadoop is not good with Small Files
– Hadoop is suitable for a small number of large files but not
with a large number of small files. A small file is nothing but
a file which is significantly smaller than Hadoop’s block size
(128MB). These large number of small files overload the
Namenode as it stores namespace for the system and makes
it difficult for Hadoop to function.
• Processing Overhead
– Data is read from the disk and written to the disk which
makes read/write operations very expensive when with Big
Data (eg petabytes of data). Hadoop cannot do in memory
calculations and incurs processing overhead
Disadvantage of Hadoop
• Hadoop supports only Batch Processing
– Hadoop is a batch processing system which is inefficient for
stream processing and realtime requirements with low latency.
It only works on data which we collect and store in a file in
advance before processing.
• Security
– For security , Hadoop uses Kerberos authentication which is
hard to manage and does not have encryption at storage and
network levels
• Vulnerable
– Hadoop is written in Java and it can be exploited by cyber
criminals which makes it vulnerable to security breaches.
Installing Hadoop on your PC
• Hadoop is designed to be deployed on a large cluster of networked
computers with master nodes and slave nodes. However, you can run
Hadoop on a single computer to help to learn the basics of Hadoop

• Hadoop has two deployment modes: pseudo-distributed mode and fully


distributed mode

• To download on your PC:


https://www.cloudera.com/downloads/quickstart_vms/5-13.html

• Useful tips: https://community.cloudera.com/t5/Community-Articles/How-


to-setup-Cloudera-Quickstart-Virtual-Machine/ta-p/35056
Installing Hadoop On Your Personal Machine

• These instructions use the Cloudera QuickStart VM which runs inside the VirtualBox virtual environment,
which needs to be installed first.

•  Download and install VirtualBox on your machine: https://www.virtualbox.org/wiki/Downloads


•  Download the Cloudera Quickstart VM (it is free but you have to register). Select VirtualBox as the
platform https://www.cloudera.com/downloads/quickstart_vms/5-12.html
•  Uncompress the VM archive (7-Zip is recommended)
•  Start VirtualBox and click Import Appliance in the File dropdown menu. Click the folder icon beside the
location field. Browse to the uncompressed archive folder, select the .ovf file, and click the Open button.
Click the Continue button. Click the Import button.
•  Your virtual machine should now appear in the left column. Select it and click on Start to launch it.
•  To verify that the VM is running and you can access it, open a browser to the URL: http://localhost:8088.
You should see the resource manager UI. The VM uses port forwarding for the common Hadoop ports, so
when the VM is running, those ports on localhost will redirect to the VM.

• The Cloudera QuickStart VM is a single-node cluster making it easy to quickly get hands on experience with
Hadoop, Hive and Pig.

• Note that the Cloudera Quickstart VM archive is a large file in excess of 5GB so it might make sense to
download it overnight (you might need to disable your computers sleep function).
Video
• https://www.youtube.com/watch?
v=DCaiZq3aBSc&t=582s

You might also like