HADOOP
HADOOP
HADOOP
Traditional Approach - In this approach, an enterprise will have a computer to store and
process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL
Server or DB2 and sophisticated softwares can be written to interact with the database,
process the required data and present it to the users for analysis purpose.
Google’s Solution - Google solved this problem using an algorithm called MapReduce.
This algorithm divides the task into small parts and assigns those parts to many computers
connected over the network, and collects the results to form the final result dataset.
Hadoop - Introduction
What happens?
Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These
libraries provides filesystem and OS level abstractions and contains the necessary Java files and
scripts required to start Hadoop.
Hadoop YARN (Yet Another Resource Negotiator): This is a framework for job scheduling and cluster
resource management.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput
access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.
We can use following diagram to depict these four components available in Hadoop framework.
Hadoop Operation Modes
Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the
three supported modes:
Fully Distributed Mode : This mode is fully distributed with minimum two or more
machines as a cluster. We will come across this mode in detail in the coming chapters.
Hadoop Core Components
What is HDFS
A specially design file system for storing huge data set with cluster of commodities
hardware of streaming access pattern
HDFS Architecture
Hadoop Distributed File System -
Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3
FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File
System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on large clusters (thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single NameNode that manages
the file system metadata and one or more slave DataNodes that store the actual data.
Name Node -
❏ It is Master of System
Data Node -
❏ Slaves which are deployed on each machine and provide the actual storage
❏ Responsible for serving read and write requests for the client
❏ The DataNodes takes care of read and write operation with the file system. They also take
care of block creation, deletion and replication based on instruction given by NameNode.
MapReduce
JOB Tracker & Task Tracker
JOB Tracker
Job Tracker Contd..
Job Tracker Contd..
Job Tracker Contd...
HDFS Client Creates New File
Rack Awareness
Anatomy of File Read
Anatomy of file Write
MapReduce Flow Chart- Word Count Job
MapReduce Flow Chart
The Hadoop Ecosystem
The Hadoop ecosystem includes other tools to address particular needs.
Hive - A data warehouse infrastructure built on top of Hadoop for providing the
data summarization, query and analysis.
Impala- High performance SQL engine for vast amounts of data, Impala runs on
Hadoop clusters
Pig- A high level platform for creating MapReduce program using language called
Pig latin. It's do same job as sql does. If you want to retrieve data from certain
level, you don’t need write detail program. You need to simple sql command.
Oozie- A workflow scheduler system to manage hadoop jobs. Java based web
application that are responsible for
Flume- A distributed service for collecting, aggregating and moving large amount
of log data. It has simple and flexible architecture based on streaming data flow. It
use simple and extensible data model that allows for online analytic application.
Scoop- This is tool to transfer bulk data between apache Hadoop and structured
datastores such as relational database
Testing Types under Big Data Testing
Big Data Testing
Testing Big Data application is more a verification of its data processing rather
than testing the individual features of the software product. When it comes to
Big data testing, performance and functional testing are the key.
The first step of big data testing, also referred as pre-Hadoop stage involves
process validation.
Data from various source like RDBMS, weblogs, social media, etc. should be
validated to make sure that correct data is pulled into system
Comparing source data with the data pushed into the Hadoop system to make
sure they match
Verify the right data is extracted and loaded into the correct HDFS location
Tools like Talend, Datameer, can be used for data staging validation
Step 2: "MapReduce" Validation
The second step is a validation of "MapReduce". In this stage, the tester verifies
the business logic validation on every node and then validating them after running
against multiple nodes, ensuring that the
Map Reduce process works correctly
Data aggregation or segregation rules are implemented on the data
Key value pairs are generated
Validating the data after Map Reduce process
Step 3: Output Validation Phase
The final or third stage of Big Data testing is the output validation process. The
output data files are generated and ready to be moved to an EDW (Enterprise
Data Warehouse) or any other system based on the requirement.
Activities in third stage includes
Virtualization
It is one of the integral phases of testing. Virtual machine latency creates timing problems in real time
big data testing. Also managing images in Big data is a hassle.
Large Dataset
Need to verify more data and need to do it faster
Need to automate the testing effort
Need to be able to test across different platform
Introduction to Cloudera
Founded in 2008, Cloudera was the first, and is currently, the leading provider
and supporter of Apache Hadoop for the enterprise.
Cloudera offers enterprises one place to store, process, and analyze all their
data, empowering them to extend the value of existing investments while
enabling fundamental new ways to derive value from their data.
Why do customer choose cloudera ?
Cloudera was the first commercial provider of Hadoop-related software and
services and has the most customers with enterprise requirements, and the
most experience supporting them, in the industry.