Big Data
Big Data
Big Data
Objective: Downloading and installing Hadoop. Understanding different Hadoop modes, startup scripts,
configuration files.
THEORY:
1. Stand Alone Mode: Hadoop is a distributed software and is designed to run on a commodity of machines.
However, we can install it on a single node in stand-alone mode. In this mode, Hadoop software runs as a single
monolithic Java process. This mode is beneficial for debugging purposes. You can first test run your Map-
Reduce application in this mode on small data before executing it on a cluster with big data.
2. Pseudo-Distributed Mode: In this mode also, Hadoop software is installed on a Single Node. Various
daemons of Hadoop will run on the same machine as separate Java processes. Hence all the daemons namely
NameNode, DataNode, SecondaryNameNode, JobTracker, and TaskTracker run on a single machine.
3. Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode, JobTracker, and
SecondaryNameNode (Optional and can be run on a separate node) run on the Master Node. The daemons
DataNode and TaskTracker run on the Slave Node.
Go to the Apache Hadoop website and download the latest stable release of Hadoop. Extract the downloaded
archive to a desired directory.
Set the environment variables for Hadoop in the hadoop-env.sh file located in the etc/hadoop directory. You
will need to set JAVA_HOME to the path of your Java installation and HADOOP_HOME to the path of your
Hadoop installation.
Configure the Hadoop cluster by editing the core-site.xml, hdfs-site.xml, and mapred-site.xml files located in
the etc/hadoop directory.
Standalone Mode: This mode is useful for development and testing purposes. In this mode, Hadoop runs on a
single node and does not require any Hadoop daemons to be running.
Pseudo-Distributed Mode: In this mode, Hadoop runs on a single machine, but it simulates a cluster by running
all the Hadoop daemons on the same machine. This mode is useful for development and testing purposes.
Fully-Distributed Mode: In this mode, Hadoop runs on a cluster of machines. Each machine in the cluster runs a
set of Hadoop daemons, and the cluster is managed by a single master node.
Startup Scripts:
Hadoop provides several scripts to start and stop the various Hadoop daemons. These scripts are located in the
sbin directory in the Hadoop installation directory. Some of the important scripts are:
start-dfs.sh: This script starts the Hadoop Distributed File System (HDFS) daemons.
start-yarn.sh: This script starts the Yet Another Resource Negotiator (YARN) daemons.
stop-dfs.sh: This script stops the HDFS daemons.
Hadoop uses XML-based configuration files to configure the various Hadoop daemons.
These files are located in the etc/hadoop directory in the Hadoop installation directory.
Some of the important configuration files are:
core-site.xml: This file contains configuration settings that are common to both HDFS and YARN, such as the
default file system URI and the Hadoop temporary directory.
hdfs-site.xml: This file contains configuration settings specific to the HDFS, such as the replication factor and
block size.
mapred-site.xml: This file contains configuration settings specific to the MapReduce framework, such as the
number of map and reduce tasks that can run concurrently.
These configuration files can be edited to change the configuration settings for the various Hadoop daemons
Experiment-2
2. Retrieving files
3. Deleting files
Before you can run Hadoop programs on data stored in HDFS, you'll need to put the data into HDFS first.
Let's create a directory and put a file in it. HDFS has a default working directory of/user/SUSER, where $USER
is your login user name. This directory isn't automatically created for you, though, so Let's create it with the
mkdir command.
Commands:
Usage:
Example:
Example:
Retrieving files from HDFS involves reading data from the distributed file system and copying it to the local
file system. Hadoop provides several command-line tools to retrieve files from HDFS.
Commands:
hadoop fs-cat/user/data/sampletext.txt
Usage:
Example:
Deleting files from Hadoop involves removing data from the distributed file system. Hadoop provides several
command-line tools to delete files from HDFS, including hadoop fs-rm, hadoop fs -rmr, hadoop fs -expunge,
and others.
Commands:
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
matrix = line[0];
Integer.parseInt(line[3]);
pairs if (matrix.equals("A")) {
outputValue);
} else {
context.write(outputKey, outputValue);
Reduce function
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
matrix = line[0];
val = Integer.parseInt(line[2]); if
(matrix.equals("A")) {
vectorA[row] = val;
} else {
vectorB[row] = val;
outputValue.set(sum);
context.write(key, outputValue);
Driver program:
job.setJarByClass(MatrixMultiplication.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// Set input and output key-value
classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.
class);
1
Experiment-4
Objective: Write a MapReduce program that mines weather data. Hint: Weather sensors collecting data
every hour at many locations across the globe gather a large volume of log data, which is a good
candidate for analysis with Map Reduce, since it is semi structured and record-oriented.
PROGRAM LOGIC:
Word Count is a simple program which counts the number of occurrences of each word in a given text input
data set. Word Count fits very well with the MapReduce programming model making it a great example to
understand the Hadoop Map/Reduce programming style. Our implementation consists of three main parts:
1. Mapper
2. Reducer
3. Main program
output.collect(x, 1);
}
void Map (key, value){
for each min_temp x in value:
output.collect(x, 1);
}
}
void Reduce (min_temp, <list of value>){ for each x in <list
of value>:
sum+=x; final_output.collect(min_temp, sum);
}
3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform basic
configurations such as:
Job Name : name of this Job Executable (Jar)
Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map. Reducer: class
which override the "reduce" function. For here , Reduce. Output Key: type of output
key. For here, Text.Output
Value: type of output value. For here, IntWritable. File Input Path
File Output Path
INPUT/OUTPUT:
Set of Weather Data over the years:
Experiment : 5
Objective: To run a basic Word Count program using the MapReduce paradigm to understand how it
works.
Map Function: The map function tokenizes each line of input text into words and emits a key-value pair for each
word with a value of one.
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
Reduce function: The reduce function sums up the counts for each word and emits a final key-value pair with
the word and its total count.
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
Driver program: Driver Program: The driver program sets up the job configuration, specifying the input and
output paths, the mapper and reducer classes, and the output key-value types. It then runs the job and waits for its
completion.
The following are the steps to implement K-means clustering using MapReduce:
Input Data Preparation: The input data should be split into multiple blocks and stored in a distributed file
system such as Hadoop Distributed File System (HDFS).
Initialization: Initialize the centroids by randomly selecting k points from the input data. These k points will be
the initial centroids for the K-means algorithm.
Map Step: In the Map step, each data point is assigned to the nearest centroid. The distance metric used for
measuring the distance between data points and centroids can be Euclidean distance or Manhattan distance.
Reduce Step: In the Reduce step, the mean of all the data points assigned to a centroid is calculated. This mean
becomes the new centroid for that cluster.
Repeat Steps 3 and 4: The Map and Reduce steps are repeated until the centroids converge. The convergence
can be checked by comparing the new centroids with the old centroids. If the new centroids are the same as the
old centroids, the algorithm has converged.
Output Results: The final output of the algorithm is the set of k centroids.
Implementing K-means clustering using MapReduce can be computationally expensive due to the MapReduce
overhead, but it can be parallelized to process large datasets efficiently. The scalability of MapReduce makes it
an ideal framework for big data processing and analysis.
Program :
import java.io.IOException;
import java.util.ArrayList; import
java.util.List;
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.FileSystem; import
org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool; import
org.apache.hadoop.util.ToolRunner; public class
KMeansMR implements Tool {
private Configuration conf;
public static void main(String[] args) throws Exception { int res =
ToolRunner.run(new KMeansMR(), args); System.exit(res);
}
public int run(String[] args) throws Exception { if
(args.length < 3) {
1. First we downloaded the latest version of Apache Hive from the official website:
https://hive.apache.org/downloads.html
2. Then we have extracted the downloaded archive to a directory of our choice.
3. Then we set the environment variables HIVE_HOME and PATH to the location where we
have extracted Hive. For example, we have extracted it to /usr/local/hive, so we have added the
following lines to .bashrc or .bash_profile
file:
export HIVE_HOME=/usr/local/hive export
PATH=$PATH:$HIVE_HOME/bin
4. We Started the Hive metastore service by running the following command:
$HIVE_HOME/bin/hive --service metastore &
5. Start the Hive server by running the following command:
6. $HIVE_HOME/bin/hiveserver2 &
7. Now it is running an instance of Hive that we can connect to and use.
To practice using Hive, we have used the Hive command-line interface (CLI):
Create a table:
id INT,
name
STRING,
age INT
Download the latest version of Apache HBase from the official website:
choice.
Set the environment variable HBASE_HOME to the location where you extracted HBase. For example, if
you extracted it to /usr/local/hbase, you would add the following line to your .bashrc or .bash_profile file:
export HBASE_HOME=/usr/local/hbase
$HBASE_HOME/bin/start-hbase.sh
You should now have a running instance of HBase that you can connect to and use. To
Download the latest version of Apache Thrift from the official website:
https://thrift.apache.org/download
Install the required dependencies for your platform. This will typically include a C++ compiler, Python, and
the development headers for your operating system.
./configu
re make
make install
To practice using HBase and Thrift, you can try out some
Create a table:
Query the data using the Thrift API. First, generate the Thrift bindings for your language of choice:
thrift -r -gen <language> /path/to/HBase.thrift For
Importing and exporting data from databases is an essential part of data management and analysis. Here are
some examples of how to import and export data from various databases.
To import data from MySQL, you can use the MySQL command line client or a graphical client like MySQL
Workbench. Here is an example of how to import data using the command line client:
This command will import the data from the file data.sql into the database dbname.
To export data from MySQL, you can use the mysqldump command. Here is an example of how to
export data using the mysqldump command:
mysqldump -u username -p dbname > data.sql
This command will export the data from the database dbname into the file data.sql.
To import data from PostgreSQL, you can use the psql command line client. Here is an example of how to
import data using the psql client:
This command will import the data from the file data.sql into the database dbname.
To export data from PostgreSQL, you can use the pg_dump command. Here is an example of how to export
data using the pg_dump command:
This command will export the data from the table tablename in the database dbname into the
To import data into MongoDB, you can use the mongoimport command line tool. Here is an example of how to
import data using the mongoimport tool:
This command will import the data from the file data.json into the collection collectionname in the database
dbname.
To export data from MongoDB, you can use the mongoexport command line tool. Here is an example of
how to export data using the mongoexport tool:
This command will export the data from the collection collectionname in the database dbname into
To import data into Oracle, you can use the SQLLoader utility. Here is an example of how to import data using
SQLLoader:
sqlldr username/password@dbname control=loader.ctl
This command will import the data specified in the control file loader.ctl into the database dbname. To export
data from Oracle, you can use the Data Pump Export utility. Here is an example of how toexport data using Data
Pump Export:
This command will export the data from the table tablename in the database dbname into the file
data.dmp. The directory parameter specifies the directory where the file will be written.
Experiment : 10
Write PIG Commands: Write Pig Latin scripts sort, group, join, project, and filter your data.
Sorting data
To sort data in Pig Latin, you can use the ORDER BY clause followed by the name of the column you
want to sort by. For example, the following script sorts data by the age column in ascending order:
Grouping data
To group data in Pig Latin, you can use the GROUP BY clause followed by the name of the column you want
to group by. For example, the following script groups data by the gender column:
Joining data
To join data in Pig Latin, you can use the JOIN clause followed by the names of the relations you want to
join and the join condition. For example, the following script joins two relations (data1 and data2) on the
name column:
Projecting data
To project specific columns in Pig Latin, you can use the FOREACH clause followed by the name of the
columns you want to project. For example, the following script projects the name and age columns:
Filtering data
To filter data in Pig Latin, you can use the FILTER clause followed by the condition you want to filter by.
For example, the following script filters data where the age column is greater than or equal to 18:
words BY word;
Run the Pig Latin Scripts to find a max temp for each and every year.
year;
This script loads data from a file containing temperature readings for each day, including the year,
month, day, and temperature. The script performs the following operations:
This script uses Pig Latin operations such as LOAD, FOREACH, GROUP, MAX, and STORE to perform the
calculations.