Bda A2

NAME- DEEPTI AGRAWAL
BDA LAB ASSIGNMENT NUMBER -02

REG. NUMBER- 201081010
What is a Hadoop Cluster?

In general, a computer cluster is a collection of various computers that
work collectively as a single system.
“A hadoop cluster is a collection of independent components connected
through a dedicated network to work as a single centralized data
processing resource. “
“A hadoop cluster can be referred to as a computational computer cluster
for storing and analysing big data (structured, semi-structured and
unstructured) in a distributed environment.” “A computational computer
cluster that distributes data analysis workload across various cluster
nodes that work collectively to process the data in parallel.”
Hadoop clusters are also known as “Shared Nothing” systems because
nothing is shared between the nodes in a hadoop cluster except for the
network which connects them. The shared nothing paradigm of a hadoop
cluster reduces the processing latency so when there is a need to process
queries on huge amounts of data the cluster-wide latency is completely
minimized.
Hadoop
Apache Hadoop, commonly referred to as Hadoop, is one specific software

developed under Apache that provides a framework for storing and
processing large datasets. It is most appealing because it is open source
and it‘s the biggest competitor to Google’s BigTable big data storage
system.
The Hadoop framework is composed of four main “modules”:

Hadoop Distributed File System (HDFS)
Yet Another Resource Negotiator (YARN)
MapReduce
Hadoop Common
These modules serve as nodes and algorithms performed by these nodes

that, when run together, form a cluster. This cluster is capable of
performing tasks normally not possible on one computer because the
input is too large or tasks are too computationally intensive, by working
together — a fundamental principle for distributed systems.
It’s important to understand that Hadoop is actually a system of multiple
distributed systems interacting with each other. MapReduce is an
algorithm that performs distributed computing, the HDFS is a distributed
storage system, YARN sets up the framework for what roles nodes need to
play for the distributed system to exist and to run tasks, and there is a
separate service called ZooKeeper that acts as its own cluster to provide
distributed configuration service.
Hadoop Cluster Architecture
A hadoop cluster architecture consists of a data centre, rack and the node
that actually executes the jobs. Data centre consists of the racks and racks
consists of nodes. A medium to large cluster consists of a two or three
level hadoop cluster architecture that is built with rack mounted servers.
Every rack of servers is interconnected through 1 gigabyte of Ethernet (1
GigE). Each rack level switch in a hadoop cluster is connected to a cluster
level switch which are in turn connected to other cluster level switches or
they uplink to other switching infrastructure.
Components of a Hadoop Cluster
Hadoop cluster consists of three components -
Master Node – Master node in a hadoop cluster is responsible for storing

data in HDFS and executing parallel computation the stored data using
MapReduce. Master Node has 3 nodes – NameNode, Secondary
NameNode and JobTracker. JobTracker monitors the parallel processing of
data using MapReduce while the NameNode handles the data storage
function with HDFS. NameNode keeps a track of all the information on
files (i.e. the metadata on files) such as the access time of the file, which
user is accessing a file on current time and which file is saved in which
hadoop cluster. The secondary NameNode keeps a backup of the
NameNode data.
Slave/Worker Node- This component in a hadoop cluster is responsible for
storing the data and performing computations. Every slave/worker node
runs both a TaskTracker and a DataNode service to communicate with the
Master node in the cluster. The DataNode service is secondary to the
NameNode and the TaskTracker service is secondary to the JobTracker.
Client Nodes – Client node has hadoop installed with all the required
cluster configuration settings and is responsible for loading all the data
into the hadoop cluster. Client node submits mapreduce jobs describing on
how data needs to be processed and then the output is retrieved by the
client node once the job processing is completed.
Advantages of a Hadoop Cluster Setup
As big data grows exponentially, parallel processing capabilities of a

Hadoop cluster help in increasing the speed of analysis process. However,
the processing power of a hadoop cluster might become inadequate with
increasing volume of data. In such scenarios, hadoop clusters can scaled
out easily to keep up with speed of analysis by adding extra cluster nodes
without having to make modifications to the application logic.
Hadoop cluster setup is inexpensive as they are held down by cheap
commodity hardware. Any organization can setup a powerful hadoop
cluster without having to spend on expensive server hardware.
Hadoop clusters are resilient to failure meaning whenever data is sent to a
particular node for analysis, it is also replicated to other nodes on the
hadoop cluster. If the node
fails then the replicated copy of the data present on the other node in the
cluster can be used for analysis.
Execution:
1. Download tar file for hadoop from
this website:
https://hadoop.apache.org/release/
3.3.6.html
2. Extract this tar file using tar command:
$ tar xzf hadoop-3.3.4.tar.gz
3. Rename the file using to hadoop
$ mv hadoop-3.3.4 hadoop
4. Open the hadoop-env.sh file in the nano editor and edit the
JAVA_HOME variable
5. Move the hadoop directory from the home to user/local/ directory

6. Open the environment file with this command, then add the
JAVA_HOME
7. Create a hadoop user
8. Edit .bashrc file, add following

#Hadoop Related Options
export
HADOOP_HOME="/usr/local/hadoop"
export
HADOOP_INSTALL=$HADOOP_HOME
export
HADOOP_MAPRED_HOME=$HADOO
P_HOME export
HADOOP_COMMON_HOME=$HADOO
P_HOME export
HADOOP_HDFS_HOME=$HADOOP_H
OME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/li
b/native export
PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export
HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
9. In order to apply the changes to the environment, run the following

command:
$ source ~/.bashrc
10.Run the following command to get the ip address of the machine:

11. Now we will write down these IP addresses in the hosts file.
12. Now open the hostname file on both the machines. Rename
the machines to master and slave1 resp

13.Now in the master node we create rsa key pair
14. Now we setup this public key in both master and slave
15. Now, open the core-site.xml file and add the following configurations
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
16. Now, open the hdfs-site.xml file and add the following configurations
as shown:
<configuration>
<property>
<name>dfs.name.dir</name><value>/usr/local/hadoop/data/nameNo
de</ value>
</property>
<property>
<name>dfs.data.dir</name><value>/usr/local/hadoop/data/dataNod
e</ value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
17. Now, open the mapred-site.xml file and add the following
configurations as shown:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:9001</value>
</property>
</configuration>
18. Now, on master, Open the workers file:
$ sudo nano /usr/local/hadoop/etc/hadoop/workers
Add the following lines as shown:
19. Use the following command to copy the files to slave node
$ scp -r /usr/local/hadoop/* slave1:/usr/local/hadoop
20.Now, on the slave1 open the yarn-site.xml file using following
command:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
</configuration>
21. Now we need to format the HDFS file system. Run this
command on the master node:
$ hdfs namenode -format

22. Now, on the hadoop-master we will start the HDFS network
using the following command: Verify the output by applying the
following command in both the machines:

23. For hadoop-slave:
Conclusion:
In conclusion, this experiment helped me learn how to set up a multi-node
Hadoop cluster. By configuring different parts of Hadoop and
understanding how it handles errors, I gained valuable skills for managing
large amounts of data. Doing tasks like formatting the HDFS file system
also gave me hands-on experience in working with big data systems. This
experiment improved my understanding of Hadoop's structure and taught
me how to build and manage data processing setups effectively.

Bda A2

Uploaded by

Copyright:

Available Formats

Bda A2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda A2

Uploaded by

Copyright:

Available Formats

NAME- DEEPTI AGRAWAL

BDA LAB ASSIGNMENT NUMBER -02

What is a Hadoop Cluster?

Apache Hadoop, commonly referred to as Hadoop, is one specific software

The Hadoop framework is composed of four main “modules”:

These modules serve as nodes and algorithms performed by these nodes

Hadoop Cluster Architecture

Components of a Hadoop Cluster

Hadoop cluster consists of three components -

Master Node – Master node in a hadoop cluster is responsible for storing

Advantages of a Hadoop Cluster Setup

As big data grows exponentially, parallel processing capabilities of a

1. Download tar file for hadoop from

$ tar xzf hadoop-3.3.4.tar.gz

3. Rename the file using to hadoop

5. Move the hadoop directory from the home to user/local/ directory

8. Edit .bashrc file, add following

9. In order to apply the changes to the environment, run the following

10.Run the following command to get the ip address of the machine:

the machines to master and slave1 resp

$ sudo nano /usr/local/hadoop/etc/hadoop/workers

Add the following lines as shown:

$ hdfs namenode -format

following command in both the machines:

You might also like