HADOOP

Big Data
Challenges & Solutions

Powered By CardinalTS
Author By - Vivek Garg
Facts About Big Data ?
Walmart handles more than 1 million customer transactions
every hour.
Facebook handles 40 billion photos from its user base.
Every minute up to 300 hours of video are uploaded to

YouTube alone.
By 2020, we will have over 6.1 billion smartphone users

globally.
WHAT COMES UNDER BIG DATA?
Big data involves the data produced by different devices and applications. Given
below are some of the fields that come under the umbrella of Big Data.
❖ Black Box Data
❖ Social Media Data
❖ Stock Exchange Data
❖ Search Engine Data and so on...

Traditional Vs Big Data
V Attributes of Big Data
Big Data is being characterized by three V attributes
Velocity Volume Variety

Big Data Challenges -
The major challenges associated with big data are as follows:

Capturing data
Storage
Searching
Sharing
Transfer
Analysis
Presentation
Big Data Solutions -
Traditional Approach - In this approach, an enterprise will have a computer to store and
process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL
Server or DB2 and sophisticated softwares can be written to interact with the database,
process the required data and present it to the users for analysis purpose.
Google’s Solution - Google solved this problem using an algorithm called MapReduce.
This algorithm divides the task into small parts and assigns those parts to many computers
connected over the network, and collects the results to form the final result dataset.
Hadoop - Introduction
Apache™ Hadoop® is an open source software project that enables distributed

processing of large structured, semi-structured, and unstructured data sets across
clusters of commodity servers. It is designed to scale up from a single server to
thousands of machines, with a very high degree of fault tolerance. (IBM)
Hadoop is an open-source software framework for storing data and running

applications on clusters of commodity hardware. It provides massive storage for
any kind of data, enormous processing power and the ability to handle virtually
limitless concurrent tasks or jobs. (SaS)
Hadoop: Very High-Level Overview
Core Hadoop Concepts
★Applications are written in high-level code
★Nodes talk to each other as little as possible
★Data is distributed in advance
-Bring the computation to the data
★Data is replicated for increased availability and reliability
★Hadoop is scalable and fault-tolerant

Scalability
Adding nodes adds capacity proportionally .
Increasing load results in a graceful decline in performance
- Not Failure of system

Fault Tolerance
Node failure is inevitable
What happens?
– System continues to function
– Master re-assigns tasks to a different node
– Data replication = no loss of data
– Nodes which recover rejoin the cluster automatically

Hadoop Architecture
Hadoop Architecture -
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These
libraries provides filesystem and OS level abstractions and contains the necessary Java files and
scripts required to start Hadoop.
Hadoop YARN (Yet Another Resource Negotiator): This is a framework for job scheduling and cluster
resource management.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput
access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.
We can use following diagram to depict these four components available in Hadoop framework.
Hadoop Operation Modes
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the
three supported modes:
Local/Standalone Mode : After downloading Hadoop in your system, by default, it is

configured in a standalone mode and can be run as a single java process.
Pseudo Distributed Mode : It is a distributed simulation on single machine. Each Hadoop

daemon such as hdfs, yarn, MapReduce etc., will run as a separate java process. This
mode is useful for development.
Fully Distributed Mode : This mode is fully distributed with minimum two or more
machines as a cluster. We will come across this mode in detail in the coming chapters.
Hadoop Core Components
What is HDFS
A specially design file system for storing huge data set with cluster of commodities
hardware of streaming access pattern
HDFS Architecture
Hadoop Distributed File System -
Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3
FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File
System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on large clusters (thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single NameNode that manages
the file system metadata and one or more slave DataNodes that store the actual data.
Name Node -
❏ It is Master of System
❏ Maintain and manage the blocks which are present on datanode
❏ The NameNode determines the mapping of blocks to the DataNodes.
Data Node -
❏ Slaves which are deployed on each machine and provide the actual storage
❏ Responsible for serving read and write requests for the client
❏ The DataNodes takes care of read and write operation with the file system. They also take
care of block creation, deletion and replication based on instruction given by NameNode.
MapReduce
JOB Tracker & Task Tracker
JOB Tracker
Job Tracker Contd..
Job Tracker Contd..
Job Tracker Contd...
HDFS Client Creates New File
Rack Awareness
Anatomy of File Read
Anatomy of file Write
MapReduce Flow Chart- Word Count Job
MapReduce Flow Chart
The Hadoop Ecosystem
The Hadoop ecosystem includes other tools to address particular needs.
Hive - A data warehouse infrastructure built on top of Hadoop for providing the
data summarization, query and analysis.
HBase - A Open source, non-relational, distributed database, modeled after

Google’s big table written in java, it runs on top of HDFS
Impala- High performance SQL engine for vast amounts of data, Impala runs on
Hadoop clusters
Pig- A high level platform for creating MapReduce program using language called
Pig latin. It's do same job as sql does. If you want to retrieve data from certain
level, you don’t need write detail program. You need to simple sql command.
Oozie- A workflow scheduler system to manage hadoop jobs. Java based web
application that are responsible for
Flume- A distributed service for collecting, aggregating and moving large amount
of log data. It has simple and flexible architecture based on streaming data flow. It
use simple and extensible data model that allows for online analytic application.
Scoop- This is tool to transfer bulk data between apache Hadoop and structured
datastores such as relational database
Testing Types under Big Data Testing
Big Data Testing
Testing Big Data application is more a verification of its data processing rather
than testing the individual features of the software product. When it comes to
Big data testing, performance and functional testing are the key.
In Big data testing QA engineers verify the successful processing of terabytes of

data using commodity cluster and other supportive components. It demands a
high level of testing skills as the processing is very fast. Processing may be of
three types
Data quality is also an important factor in big data testing. Before testing the application, it is
necessary to check the quality of data and should be considered as a part of database
testing.
It involves checking various characteristics like conformity, accuracy, duplication,

consistency, validity, data completeness, etc.
Testing Steps in verifying Big Data Applications
Step 1: Data Staging Validation
The first step of big data testing, also referred as pre-Hadoop stage involves
process validation.
Data from various source like RDBMS, weblogs, social media, etc. should be
validated to make sure that correct data is pulled into system
Comparing source data with the data pushed into the Hadoop system to make
sure they match
Verify the right data is extracted and loaded into the correct HDFS location
Tools like Talend, Datameer, can be used for data staging validation
Step 2: "MapReduce" Validation
The second step is a validation of "MapReduce". In this stage, the tester verifies
the business logic validation on every node and then validating them after running
against multiple nodes, ensuring that the
Map Reduce process works correctly
Data aggregation or segregation rules are implemented on the data
Key value pairs are generated
Validating the data after Map Reduce process
Step 3: Output Validation Phase
The final or third stage of Big Data testing is the output validation process. The
output data files are generated and ready to be moved to an EDW (Enterprise
Data Warehouse) or any other system based on the requirement.
Activities in third stage includes
To check the transformation rules are correctly applied

To check the data integrity and successful data load into the target system
To check that there is no data corruption by comparing the target data with the
HDFS file system data
BIG Data - Test Automation flow
Challenges in Big Data Testing
Automation
Automation testing for Big data requires someone with a technical expertise. Also, automated tools are
not equipped to handle unexpected problems that arise during testing
Virtualization
It is one of the integral phases of testing. Virtual machine latency creates timing problems in real time
big data testing. Also managing images in Big data is a hassle.
Large Dataset
Need to verify more data and need to do it faster
Need to automate the testing effort
Need to be able to test across different platform
Introduction to Cloudera
Founded in 2008, Cloudera was the first, and is currently, the leading provider
and supporter of Apache Hadoop for the enterprise.
Cloudera is revolutionizing enterprise data management by offering the first

unified Platform for Big Data: The Enterprise Data Hub.
Cloudera offers enterprises one place to store, process, and analyze all their
data, empowering them to extend the value of existing investments while
enabling fundamental new ways to derive value from their data.
Why do customer choose cloudera ?
Cloudera was the first commercial provider of Hadoop-related software and
services and has the most customers with enterprise requirements, and the
most experience supporting them, in the industry.
Cloudera’s combined offering of differentiated software (open and closed

source), support, training, professional services, and indemnity brings
customers the greatest business value, in the shortest amount of time, at the
lowest TCO.
CDH (Cloudera’sDistribution,including Apache
Hadoop)
100% open source, enterprise-ready distribution of Hadoop and related
projects
The most complete, tested, and widely-deployed distribution of

Hadoop
Integrates all key Hadoop ecosystem projects
CDH includes many Hadoop Ecosystem components
Following are more details on some of the key components

What is Cloudera Manager?
❖ Cloudera Manager is a purpose built application designed to make the
administration of Hadoop simple and straightforward
❖ Automates the installation of a Hadoop cluster
❖ Quickly adds and configures new services on a cluster
❖ Provides real time monitoring of cluster activity
❖ Produces reports of cluster usage
❖ Manages users and groups who have access to the cluster
❖ Integrates with your existing enterprise monitoring tools

Q&A
Thank You

HADOOP

Uploaded by

Copyright:

Available Formats

HADOOP

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HADOOP

Uploaded by

Copyright:

Available Formats

Big Data

Challenges & Solutions

Facebook handles 40 billion photos from its user base.

Every minute up to 300 hours of video are uploaded to

By 2020, we will have over 6.1 billion smartphone users

❖ Black Box Data

❖ Social Media Data

❖ Stock Exchange Data

❖ Search Engine Data and so on...

Velocity Volume Variety

The major challenges associated with big data are as follows:

Apache™ Hadoop® is an open source software project that enables distributed

Hadoop is an open-source software framework for storing data and running

★Nodes talk to each other as little as possible

★Data is distributed in advance

-Bring the computation to the data

★Data is replicated for increased availability and reliability

★Hadoop is scalable and fault-tolerant

Increasing load results in a graceful decline in performance

- Not Failure of system

– System continues to function

– Master re-assigns tasks to a different node

– Data replication = no loss of data

– Nodes which recover rejoin the cluster automatically

Hadoop framework includes following four modules:

Local/Standalone Mode : After downloading Hadoop in your system, by default, it is

Pseudo Distributed Mode : It is a distributed simulation on single machine. Each Hadoop

❏ Maintain and manage the blocks which are present on datanode

❏ The NameNode determines the mapping of blocks to the DataNodes.

HBase - A Open source, non-relational, distributed database, modeled after

In Big data testing QA engineers verify the successful processing of terabytes of

It involves checking various characteristics like conformity, accuracy, duplication,

To check the transformation rules are correctly applied

Cloudera is revolutionizing enterprise data management by offering the first

Cloudera’s combined offering of differentiated software (open and closed

The most complete, tested, and widely-deployed distribution of

Integrates all key Hadoop ecosystem projects

CDH includes many Hadoop Ecosystem components

Following are more details on some of the key components

❖ Automates the installation of a Hadoop cluster

❖ Quickly adds and configures new services on a cluster

❖ Provides real time monitoring of cluster activity

❖ Produces reports of cluster usage

❖ Manages users and groups who have access to the cluster

❖ Integrates with your existing enterprise monitoring tools

You might also like