Big Data

Experiment 1
Objective: Downloading and installing Hadoop. Understanding different Hadoop modes, startup scripts,
configuration files.
THEORY:
1. Stand Alone Mode: Hadoop is a distributed software and is designed to run on a commodity of machines.
However, we can install it on a single node in stand-alone mode. In this mode, Hadoop software runs as a single
monolithic Java process. This mode is beneficial for debugging purposes. You can first test run your Map-
Reduce application in this mode on small data before executing it on a cluster with big data.
2. Pseudo-Distributed Mode: In this mode also, Hadoop software is installed on a Single Node. Various
daemons of Hadoop will run on the same machine as separate Java processes. Hence all the daemons namely
NameNode, DataNode, SecondaryNameNode, JobTracker, and TaskTracker run on a single machine.
3. Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode, JobTracker, and
SecondaryNameNode (Optional and can be run on a separate node) run on the Master Node. The daemons
DataNode and TaskTracker run on the Slave Node.
Downloading and Installing Hadoop:
To download and install Hadoop, follow these steps:
Go to the Apache Hadoop website and download the latest stable release of Hadoop. Extract the downloaded
archive to a desired directory.
Set the environment variables for Hadoop in the hadoop-env.sh file located in the etc/hadoop directory. You
will need to set JAVA_HOME to the path of your Java installation and HADOOP_HOME to the path of your
Hadoop installation.
Configure the Hadoop cluster by editing the core-site.xml, hdfs-site.xml, and mapred-site.xml files located in
the etc/hadoop directory.
Understanding Different Hadoop Modes:
There are three different modes in which Hadoop can run:
Standalone Mode: This mode is useful for development and testing purposes. In this mode, Hadoop runs on a
single node and does not require any Hadoop daemons to be running.
Pseudo-Distributed Mode: In this mode, Hadoop runs on a single machine, but it simulates a cluster by running
all the Hadoop daemons on the same machine. This mode is useful for development and testing purposes.
Fully-Distributed Mode: In this mode, Hadoop runs on a cluster of machines. Each machine in the cluster runs a
set of Hadoop daemons, and the cluster is managed by a single master node.
Startup Scripts:
Hadoop provides several scripts to start and stop the various Hadoop daemons. These scripts are located in the
sbin directory in the Hadoop installation directory. Some of the important scripts are:
start-dfs.sh: This script starts the Hadoop Distributed File System (HDFS) daemons.
start-yarn.sh: This script starts the Yet Another Resource Negotiator (YARN) daemons.
stop-dfs.sh: This script stops the HDFS daemons.
stop-yarn.sh: This script stops the YARN daemons.

Configuration Files:
Hadoop uses XML-based configuration files to configure the various Hadoop daemons.
These files are located in the etc/hadoop directory in the Hadoop installation directory.
Some of the important configuration files are:
core-site.xml: This file contains configuration settings that are common to both HDFS and YARN, such as the
default file system URI and the Hadoop temporary directory.
hdfs-site.xml: This file contains configuration settings specific to the HDFS, such as the replication factor and
block size.
mapred-site.xml: This file contains configuration settings specific to the MapReduce framework, such as the
number of map and reduce tasks that can run concurrently.
These configuration files can be edited to change the configuration settings for the various Hadoop daemons
Experiment-2
Objective: Implement the following file management tasks in Hadoop:
1. Adding files and directories
2. Retrieving files
3. Deleting files
1. Adding Files and Directories to HDFS-
Before you can run Hadoop programs on data stored in HDFS, you'll need to put the data into HDFS first.
Let's create a directory and put a file in it. HDFS has a default working directory of/user/SUSER, where $USER
is your login user name. This directory isn't automatically created for you, though, so Let's create it with the
mkdir command.
Commands:
1. Create a directory in HDFS at the given path(s) Usage: hadoop

fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/aditya/dir1/user/aditya/dir2
2. List the contents of a directory.
Usage:
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/aditya
3. Copy the contents of a file:

Usage:
hadoop fs-copyFromLocal <local source> <hdfs destination>
Example:
hadoop fs -copyToLocal /user/aditya/myfile/path/to/local/destination
4. Copy the file from the local system:

Usage:
hadoop fs -put <local source> <hdfs destination> Example: hadoop fs

-put/path/to/local/file/user/aditya/myfil
2. Retrieving Files from HDFS-
Retrieving files from HDFS involves reading data from the distributed file system and copying it to the local
file system. Hadoop provides several command-line tools to retrieve files from HDFS.
Commands:
1. Displaying contents of the file on the console: Usage:

hadoop fs-cat <file> Example:
hadoop fs-cat/user/data/sampletext.txt
2. Copies files from hdfs to the local system: Usage:
hadoop fs -get <hdfs-source-path> <local-destination-path> Example:

hadoop fs -get /user/myusername/myfile.txt /path/to/local/myfile.txt
3. Move files within the hdfs:
Usage:
hadoop fs -mv <source> <destination>
Example:
hadoop fs -mv /user/myusername/source/myfile.txt

/user/myusername/destination/myfile.txt
3. Deleting Files from HDFS-
Deleting files from Hadoop involves removing data from the distributed file system. Hadoop provides several
command-line tools to delete files from HDFS, including hadoop fs-rm, hadoop fs -rmr, hadoop fs -expunge,
and others.
Commands:
1. Deleting files from HDFS: Usage:

hadoop fs -rm <HDFS file path>
Example:
hadoop fs -rm/user/myusername/myfile
Experiment-3
Objective: Implementation of Matrix Multiplication with Hadoop Map Reduce.

Map function:
public static class Map extends Mapper<LongWritable, Text, Text, Text>
{ private Text outputKey = new Text();
private Text outputValue = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Parse input values
String[] line = value.toString().split(","); String
matrix = line[0];
int row = Integer.parseInt(line[1]); int
col = Integer.parseInt(line[2]); int val =
Integer.parseInt(line[3]);
// Emit intermediate key-value
pairs if (matrix.equals("A")) {
for (int k = 0; k < NUM_COLS_B; k++) { outputKey.set(row + ","
+ k); outputValue.set(matrix + "," + col +
"," + val); context.write(outputKey,
outputValue);
} else {
for (int i = 0; i < NUM_ROWS_A; i++) { outputKey.set(i + "," +
col); outputValue.set(matrix + "," + row + "," + val);
context.write(outputKey, outputValue);
Reduce function
public static class Reduce extends Reducer<Text, Text, Text,
IntWritable> { private IntWritable outputValue = new IntWritable();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException { // Initialize variables

int[] vectorA = new int[NUM_COLS_A]; int[]
vectorB = new int[NUM_ROWS_B];
// Parse input values
for (Text value : values) {
String[] line = value.toString().split(","); String
matrix = line[0];
int row = Integer.parseInt(line[1]); int
val = Integer.parseInt(line[2]); if
(matrix.equals("A")) {
vectorA[row] = val;
} else {
vectorB[row] = val;
// Compute dot product and emit final key-value pair int
sum = 0; for (int i = 0; i < NUM_COLS_A; i++) {
sum += vectorA[i] * vectorB[i];
outputValue.set(sum);
context.write(key, outputValue);
Driver program:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "matrix multiplication");
job.setJarByClass(MatrixMultiplication.class);
// Set input and output paths
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Set mapper and reducer classes
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// Set input and output key-value
classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.
class);
// Submit job and wait for completion System.exit(job.waitForCompletion(true) ? 0 :
1
Experiment-4
Objective: Write a MapReduce program that mines weather data. Hint: Weather sensors collecting data
every hour at many locations across the globe gather a large volume of log data, which is a good
candidate for analysis with Map Reduce, since it is semi structured and record-oriented.
PROGRAM LOGIC:
Word Count is a simple program which counts the number of occurrences of each word in a given text input
data set. Word Count fits very well with the MapReduce programming model making it a great example to
understand the Hadoop Map/Reduce programming style. Our implementation consists of three main parts:
1. Mapper
2. Reducer
3. Main program
Step 1: Mapper The Mapper extends the map function

from org.apache.hadoop.mapreduce.Mapper. It processes input pairs of <line_number, line_of_text> and outputs
<temperature, count> for each temperature reading in the text.
Pseudo-code:
void Map (key, value){
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value){
for each min_temp x in value:
output.collect(x, 1);
}
Step-2 Write a Reducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single
result. Here, the WordCount program will sum up the occurrence of each word to pairs as<word,
occurrence>.
Pseudo-code
void Reduce (max_temp, <list of value>){ for each x in <list
of value>:
sum+=x;final_output.collect(max_temp, sum);
}
void Reduce (min_temp, <list of value>){ for each x in <list
of value>:
sum+=x; final_output.collect(min_temp, sum);
}
3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform basic
configurations such as:
Job Name : name of this Job Executable (Jar)
Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map. Reducer: class
which override the "reduce" function. For here , Reduce. Output Key: type of output
key. For here, Text.Output
Value: type of output value. For here, IntWritable. File Input Path
File Output Path
INPUT/OUTPUT:
Set of Weather Data over the years:
Experiment : 5
Objective: To run a basic Word Count program using the MapReduce paradigm to understand how it
works.
Map Function: The map function tokenizes each line of input text into words and emits a key-value pair for each
word with a value of one.
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private fnal static IntWritable ONE = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, ONE);
Reduce function: The reduce function sums up the counts for each word and emits a final key-value pair with
the word and its total count.
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
Driver program: Driver Program: The driver program sets up the job configuration, specifying the input and
output paths, the mapper and reducer classes, and the output key-value types. It then runs the job and waits for its
completion.
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word
count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
System.exit(job.waitForCompletion(true)? 0 : 1);
}
Experiment : 6
Implementation of K-means clustering using Map Reduce

K-means clustering is an unsupervised machine learning algorithm that is widely used for data clustering and
analysis. MapReduce is a programming model and software framework that is used for processing large datasets
in a distributed and parallel manner. The MapReduce framework can be used to implement K-means clustering
algorithm for big data.
The following are the steps to implement K-means clustering using MapReduce:
Input Data Preparation: The input data should be split into multiple blocks and stored in a distributed file
system such as Hadoop Distributed File System (HDFS).
Initialization: Initialize the centroids by randomly selecting k points from the input data. These k points will be
the initial centroids for the K-means algorithm.
Map Step: In the Map step, each data point is assigned to the nearest centroid. The distance metric used for
measuring the distance between data points and centroids can be Euclidean distance or Manhattan distance.
Reduce Step: In the Reduce step, the mean of all the data points assigned to a centroid is calculated. This mean
becomes the new centroid for that cluster.
Repeat Steps 3 and 4: The Map and Reduce steps are repeated until the centroids converge. The convergence
can be checked by comparing the new centroids with the old centroids. If the new centroids are the same as the
old centroids, the algorithm has converged.
Output Results: The final output of the algorithm is the set of k centroids.
Implementing K-means clustering using MapReduce can be computationally expensive due to the MapReduce
overhead, but it can be parallelized to process large datasets efficiently. The scalability of MapReduce makes it
an ideal framework for big data processing and analysis.
Program :
import java.io.IOException;
import java.util.ArrayList; import
java.util.List;
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.FileSystem; import
org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool; import
org.apache.hadoop.util.ToolRunner; public class
KMeansMR implements Tool {
private Configuration conf;
public static void main(String[] args) throws Exception { int res =
ToolRunner.run(new KMeansMR(), args); System.exit(res);
}
public int run(String[] args) throws Exception { if
(args.length < 3) {
System.err.println("Usage: KMeansMR <input> <output> <k>"); return 2;

}
Path inputPath = new Path(args[0]); Path
outputPath = new Path(args[1]); int k =
Integer.parseInt(args[2]);
// Set the number of
clusters conf.setInt("k",
k);
Job job = Job.getInstance(getConf());
job.setJarByClass(KMeansMR.class);
job.setMapperClass(KMeansMapper.class);
job.setReducerClass(KMeansReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, inputPath);
FileOutputFormat.setOutputPath(job,
outputPath); return job.waitForCompletion(true)
? 0 : 1;
}
public Configuration getConf() { return
conf;
}
public void setConf(Configuration conf) { this.conf =
conf;
}
public static class KMeansMapper extends Mapper<Object, Text, IntWritable,
Text> { private List<Double[]> centroids;
protected void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int k = conf.getInt("k", 2); centroids
= new ArrayList<>();
Path centroidsPath = new Path("hdfs://path/to/centroids.txt");
FileSystem fs = FileSystem.get(conf);
FSDataInputStream in = fs.open(centroidsPath);
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line; while ((line = reader.readLine()) != null) { String[]
parts = line.split(",");
Double[] centroid = new Double[parts.length]; for (int i =
0; i < parts.length; i++) {
centroid[i] = Double.parseDouble(parts[i]);
}
centroids.add(centroid);
}
reader.close();
in.close();
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{ String[] parts = value.toString().split(",");
Double[] point = new Double[parts.length]; for (int i =
0; i < parts.length; i++) {
point[i] = Double.parseDouble(parts[i]);
}
int closest = 0;
double closestDistance = Double.MAX_VALUE; for
(int i = 0; i < centroids.size(); i++) {
Double[] centroid = centroids.get(i); double
distance = 0.0;
for (int j = 0; j < centroid.length; j++) {
distance += Math.pow(centroid[j] - point[j], 2);
}
distance = Math.sqrt(distance); if
(distance < closestDistance) {
closest = i;
closestDistance = distance;
}
}
context.write(new IntWritable(closest), value);
}
}
public static class KMeansReducer extends Reducer<IntWritable, Text, IntWritable, Text>
{ public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<Double[]> points = new ArrayList<>();
for (Text value : values) {
String[] parts = value.toString().split(","); Double[]
point = new Double[parts.length]; for (int i = 0; i <
parts.length; i++) {
point[i] = Double.parseDouble(parts[i]);
}
points.add(point);
}
Double[] centroid = computeCentroid(points);
context.write(key, new Text(arrayToString(centroid)));
}
private Double[] computeCentroid(List<Double[]> points) { int

numDimensions = points.get(0).length;
Double[] centroid = new Double[numDimensions]; for (int i =
0; i < numDimensions; i++) {
double sum = 0.0;
for (int j = 0; j < points.size(); j++) { sum
+= points.get(j)[i];
}
centroid[i] = sum / points.size();
}
return centroid;
}
private String arrayToString(Double[] array) { StringBuilder sb =

new StringBuilder();
for (int i = 0; i < array.length; i++) {
sb.append(array[i]);
if (i < array.length - 1) {
sb.append(",");
}
}
return sb.toString();
}
}
}
Experiment : 7
Installation of Hive along with practice examples.
To install Apache Hive, we can follow these steps:
1. First we downloaded the latest version of Apache Hive from the official website:
https://hive.apache.org/downloads.html
2. Then we have extracted the downloaded archive to a directory of our choice.
3. Then we set the environment variables HIVE_HOME and PATH to the location where we
have extracted Hive. For example, we have extracted it to /usr/local/hive, so we have added the
following lines to .bashrc or .bash_profile
file:
export HIVE_HOME=/usr/local/hive export
PATH=$PATH:$HIVE_HOME/bin
4. We Started the Hive metastore service by running the following command:
$HIVE_HOME/bin/hive --service metastore &
5. Start the Hive server by running the following command:
6. $HIVE_HOME/bin/hiveserver2 &
7. Now it is running an instance of Hive that we can connect to and use.
To practice using Hive, we have used the Hive command-line interface (CLI):
Create a table:
CREATE TABLE example_table (
id INT,
name
STRING,
age INT
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
8. Load some data into the table:
LOAD DATA LOCAL INPATH '/path/to/data' INTO TABLE example_table;
9. Query the table:
SELECT * FROM example_table WHERE age > 20;
10. Create a new table based on the query
results: CREATE TABLE new_table AS
SELECT name, age FROM example_table WHERE age > 20;

Experiment : 8
Installation of HBase, Installing thrift along with Practice examples
To install Apache HBase, you can follow these steps:
Download the latest version of Apache HBase from the official website:
https://hbase.apache.org/downloads.html Extract the downloaded archive to a directory of your
choice.
Set the environment variable HBASE_HOME to the location where you extracted HBase. For example, if
you extracted it to /usr/local/hbase, you would add the following line to your .bashrc or .bash_profile file:
export HBASE_HOME=/usr/local/hbase
Start HBase by running the following command:
$HBASE_HOME/bin/start-hbase.sh
You should now have a running instance of HBase that you can connect to and use. To
install Apache Thrift, you can follow these steps:
Download the latest version of Apache Thrift from the official website:
https://thrift.apache.org/download
Extract the downloaded archive to a directory of your choice.
Install the required dependencies for your platform. This will typically include a C++ compiler, Python, and
the development headers for your operating system.
Build and install Thrift by running the following commands:
./configu
re make
make install
You should now have Thrift installed and ready to use.
To practice using HBase and Thrift, you can try out some
examples: Start the HBase shell by running the following
command: $HBASE_HOME/bin/hbase shell
Create a table:
create 'example_table', 'cf' Insert
some data into the table:
put 'example_table', 'row1', 'cf:name', 'Alice' put
'example_table', 'row1', 'cf:age', '30'
put 'example_table', 'row2', 'cf:name', 'Bob' put
'example_table', 'row2', 'cf:age', '25'
Query the data using the Thrift API. First, generate the Thrift bindings for your language of choice:
thrift -r -gen <language> /path/to/HBase.thrift For
example, to generate Java bindings:
thrift -r -gen java /path/to/HBase.thrift

Experiment 9
Practice importing and exporting data from various data bases .
Importing and exporting data from databases is an essential part of data management and analysis. Here are
some examples of how to import and export data from various databases.
Importing and exporting data from MySQL
To import data from MySQL, you can use the MySQL command line client or a graphical client like MySQL
Workbench. Here is an example of how to import data using the command line client:
mysql -u username -p dbname < data.sql
This command will import the data from the file data.sql into the database dbname.
To export data from MySQL, you can use the mysqldump command. Here is an example of how to
export data using the mysqldump command:
mysqldump -u username -p dbname > data.sql
This command will export the data from the database dbname into the file data.sql.
Importing and exporting data from PostgreSQL
To import data from PostgreSQL, you can use the psql command line client. Here is an example of how to
import data using the psql client:
psql -U username -d dbname -f data.sql
This command will import the data from the file data.sql into the database dbname.
To export data from PostgreSQL, you can use the pg_dump command. Here is an example of how to export
data using the pg_dump command:
pg_dump -U username -d dbname -t tablename -f data.sql
This command will export the data from the table tablename in the database dbname into the
file data.sql. Importing and exporting data from MongoDB
To import data into MongoDB, you can use the mongoimport command line tool. Here is an example of how to
import data using the mongoimport tool:
mongoimport --host localhost --db dbname --collection collectionname --file data.json
This command will import the data from the file data.json into the collection collectionname in the database
dbname.
To export data from MongoDB, you can use the mongoexport command line tool. Here is an example of
how to export data using the mongoexport tool:
mongoexport --host localhost --db dbname --collection collectionname --out data.json
This command will export the data from the collection collectionname in the database dbname into
the file data.json. Importing and exporting data from Oracle
To import data into Oracle, you can use the SQLLoader utility. Here is an example of how to import data using
SQLLoader:
sqlldr username/password@dbname control=loader.ctl
This command will import the data specified in the control file loader.ctl into the database dbname. To export
data from Oracle, you can use the Data Pump Export utility. Here is an example of how toexport data using Data
Pump Export:
expdp username/password@dbname tables=tablename directory=export_dir dumpfile=data.dmp
This command will export the data from the table tablename in the database dbname into the file
data.dmp. The directory parameter specifies the directory where the file will be written.
Experiment : 10
Write PIG Commands: Write Pig Latin scripts sort, group, join, project, and filter your data.
Sorting data
To sort data in Pig Latin, you can use the ORDER BY clause followed by the name of the column you
want to sort by. For example, the following script sorts data by the age column in ascending order:
data = LOAD 'input_data' AS (name:chararray, age:int, gender:chararray); sorted_data
= ORDER data BY age ASC;
STORE sorted_data INTO 'output_data';
Grouping data
To group data in Pig Latin, you can use the GROUP BY clause followed by the name of the column you want
to group by. For example, the following script groups data by the gender column:
data = LOAD 'input_data' AS (name:chararray, age:int,

gender:chararray); grouped_data = GROUP data BY gender;
STORE grouped_data INTO 'output_data';
Joining data
To join data in Pig Latin, you can use the JOIN clause followed by the names of the relations you want to
join and the join condition. For example, the following script joins two relations (data1 and data2) on the
name column:
data1 = LOAD 'input_data1' AS (name:chararray, age:int);
data2 = LOAD 'input_data2' AS (name:chararray, gender:chararray);
joined_data = JOIN data1 BY name, data2 BY name;
STORE joined_data INTO 'output_data';
Projecting data
To project specific columns in Pig Latin, you can use the FOREACH clause followed by the name of the
columns you want to project. For example, the following script projects the name and age columns:
data = LOAD 'input_data' AS (name:chararray, age:int, gender:chararray); projected_data =
FOREACH data GENERATE name, age;
STORE projected_data INTO 'output_data';
Filtering data
To filter data in Pig Latin, you can use the FILTER clause followed by the condition you want to filter by.
For example, the following script filters data where the age column is greater than or equal to 18:
data = LOAD 'input_data' AS (name:chararray, age:int, gender:chararray);

filtered_data = FILTER data BY age >= 18;
STORE filtered_data INTO 'output_data';
Pig Latin script that demonstrates multiple operations and functions.
-- Load data from input file
data = LOAD '/path/to/input/file' USING PigStorage(',') AS (id:int, name:chararray, age:int, gender:chararray,
salary:float);
--Filter data to include only male employees
male_data = FILTER data BY gender == 'Male';
--Group male employees by age
grouped_data = GROUP male_data BY
age;
-- Calculate average salary for each age group
avg_salary = FOREACH grouped_data GENERATE group AS age, AVG(male_data.salary) AS avg_salary;
--Sort the results by average salary in descending order
sorted_data = ORDER avg_salary BY avg_salary DESC;
--Limit the output to the top 10 age groups with highest average salary
top_10_data = LIMIT sorted_data 10;
-- Store the results to output file
STORE top_10_data INTO '/path/to/output/file';
This script loads data from an input file and performs the following operations:
* Filters the data to include only male employees.
* Groups the male employees by age.
* Calculates the average salary for each age group.
* Sorts the results by average salary in descending order.
* Limits the output to the top 10 age groups with the highest average salary.
* Stores the results in an output file.
This script combines multiple Pig Latin operations, including LOAD, FILTER, GROUP, FOREACH,
GENERATE, ORDER, LIMIT, and STORE, and uses built-in functions such as AVG() to perform calculations
on the data.
Experiment : 11
Run the Pig Latin Scripts to find Word
Count . Word count from a single file:
data = LOAD '/path/to/input/file' USING TextLoader();
-- Split each line into words
words = FOREACH data GENERATE FLATTEN(TOKENIZE($0)) AS word;
--Group the words and count the
occurrences word_count = GROUP
words BY word;
word_count = FOREACH word_count GENERATE group AS word, COUNT(words) AS count;
STORE word_count INTO '/path/to/output/file';

This script loads data from a text file and performs the following operations:
* Splits each line into words.

* Groups the words and counts the occurrences.
Word count from multiple files:
-- Load data from input files

data = LOAD '/path/to/input/files/*' USING TextLoader();
-- Split each line into words
words = FOREACH data GENERATE FLATTEN(TOKENIZE($0)) AS word;
--Group the words and count the
occurrences word_count = GROUP
words BY word;
word_count = FOREACH word_count GENERATE group AS word, COUNT(words) AS count;
STORE word_count INTO '/path/to/output/file';
This script loads data from multiple text files using the file glob pattern (*) and performs the same word count
operations as the previous script. The only difference is in the input sour
Experiment : 12
Run the Pig Latin Scripts to find a max temp for each and every year.
data = LOAD '/path/to/input/file' USING PigStorage(',') AS (year:int, month:int, day:int,
temperature:float); -- Extract the year from the date
data = FOREACH data GENERATE year, temperature;
--Group the data by year
grouped_data = GROUP data BY
year;
-- Calculate the maximum temperature for each year
max_temp = FOREACH grouped_data GENERATE group AS year, MAX(data.temperature)
AS max_temp; -- Store the results to output file
STORE max_temp INTO '/path/to/output/file';
This script loads data from a file containing temperature readings for each day, including the year,
month, day, and temperature. The script performs the following operations:
* Extracts the year from the date.

* Groups the data by year.
* Calculates the maximum temperature for each year.
This script uses Pig Latin operations such as LOAD, FOREACH, GROUP, MAX, and STORE to perform the
calculations.

Big Data

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Big Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data

Uploaded by

Copyright:

Available Formats

Experiment 1

Downloading and Installing Hadoop:

To download and install Hadoop, follow these steps:

Understanding Different Hadoop Modes:

There are three different modes in which Hadoop can run:

stop-yarn.sh: This script stops the YARN daemons.

Objective: Implement the following file management tasks in Hadoop:

1. Adding files and directories

1. Adding Files and Directories to HDFS-

1. Create a directory in HDFS at the given path(s) Usage: hadoop

hadoop fs -mkdir /user/aditya/dir1/user/aditya/dir2

2. List the contents of a directory.

hadoop fs -ls <args>

hadoop fs -ls /user/aditya

3. Copy the contents of a file:

hadoop fs -copyToLocal /user/aditya/myfile/path/to/local/destination

4. Copy the file from the local system:

hadoop fs -put <local source> <hdfs destination> Example: hadoop fs

1. Displaying contents of the file on the console: Usage:

2. Copies files from hdfs to the local system: Usage:

hadoop fs -get <hdfs-source-path> <local-destination-path> Example:

3. Move files within the hdfs:

hadoop fs -mv <source> <destination>

hadoop fs -mv /user/myusername/source/myfile.txt

3. Deleting Files from HDFS-

1. Deleting files from HDFS: Usage:

Objective: Implementation of Matrix Multiplication with Hadoop Map Reduce.

public static class Map extends Mapper<LongWritable, Text, Text, Text>

{ private Text outputKey = new Text();

private Text outputValue = new Text();

// Parse input values

String[] line = value.toString().split(","); String

int row = Integer.parseInt(line[1]); int

col = Integer.parseInt(line[2]); int val =

// Emit intermediate key-value

for (int k = 0; k < NUM_COLS_B; k++) { outputKey.set(row + ","

+ k); outputValue.set(matrix + "," + col +

"," + val); context.write(outputKey,

for (int i = 0; i < NUM_ROWS_A; i++) { outputKey.set(i + "," +

col); outputValue.set(matrix + "," + row + "," + val);

public static class Reduce extends Reducer<Text, Text, Text,

IntWritable> { private IntWritable outputValue = new IntWritable();

InterruptedException { // Initialize variables

vectorB = new int[NUM_ROWS_B];

// Parse input values

for (Text value : values) {

String[] line = value.toString().split(","); String

int row = Integer.parseInt(line[1]); int

// Compute dot product and emit final key-value pair int

sum = 0; for (int i = 0; i < NUM_COLS_A; i++) {

sum += vectorA[i] * vectorB[i];

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "matrix multiplication");

// Set input and output paths

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));