Big Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Experiment 1

Objective: Downloading and installing Hadoop. Understanding different Hadoop modes, startup scripts,
configuration files.

THEORY:

1. Stand Alone Mode: Hadoop is a distributed software and is designed to run on a commodity of machines.
However, we can install it on a single node in stand-alone mode. In this mode, Hadoop software runs as a single
monolithic Java process. This mode is beneficial for debugging purposes. You can first test run your Map-
Reduce application in this mode on small data before executing it on a cluster with big data.

2. Pseudo-Distributed Mode: In this mode also, Hadoop software is installed on a Single Node. Various
daemons of Hadoop will run on the same machine as separate Java processes. Hence all the daemons namely
NameNode, DataNode, SecondaryNameNode, JobTracker, and TaskTracker run on a single machine.

3. Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode, JobTracker, and
SecondaryNameNode (Optional and can be run on a separate node) run on the Master Node. The daemons
DataNode and TaskTracker run on the Slave Node.

Downloading and Installing Hadoop:

To download and install Hadoop, follow these steps:

Go to the Apache Hadoop website and download the latest stable release of Hadoop. Extract the downloaded
archive to a desired directory.

Set the environment variables for Hadoop in the hadoop-env.sh file located in the etc/hadoop directory. You
will need to set JAVA_HOME to the path of your Java installation and HADOOP_HOME to the path of your
Hadoop installation.

Configure the Hadoop cluster by editing the core-site.xml, hdfs-site.xml, and mapred-site.xml files located in
the etc/hadoop directory.

Understanding Different Hadoop Modes:

There are three different modes in which Hadoop can run:

Standalone Mode: This mode is useful for development and testing purposes. In this mode, Hadoop runs on a
single node and does not require any Hadoop daemons to be running.

Pseudo-Distributed Mode: In this mode, Hadoop runs on a single machine, but it simulates a cluster by running
all the Hadoop daemons on the same machine. This mode is useful for development and testing purposes.

Fully-Distributed Mode: In this mode, Hadoop runs on a cluster of machines. Each machine in the cluster runs a
set of Hadoop daemons, and the cluster is managed by a single master node.

Startup Scripts:

Hadoop provides several scripts to start and stop the various Hadoop daemons. These scripts are located in the
sbin directory in the Hadoop installation directory. Some of the important scripts are:

start-dfs.sh: This script starts the Hadoop Distributed File System (HDFS) daemons.
start-yarn.sh: This script starts the Yet Another Resource Negotiator (YARN) daemons.
stop-dfs.sh: This script stops the HDFS daemons.

stop-yarn.sh: This script stops the YARN daemons.


Configuration Files:

Hadoop uses XML-based configuration files to configure the various Hadoop daemons.
These files are located in the etc/hadoop directory in the Hadoop installation directory.
Some of the important configuration files are:
core-site.xml: This file contains configuration settings that are common to both HDFS and YARN, such as the
default file system URI and the Hadoop temporary directory.

hdfs-site.xml: This file contains configuration settings specific to the HDFS, such as the replication factor and
block size.

mapred-site.xml: This file contains configuration settings specific to the MapReduce framework, such as the
number of map and reduce tasks that can run concurrently.

These configuration files can be edited to change the configuration settings for the various Hadoop daemons
Experiment-2

Objective: Implement the following file management tasks in Hadoop:

1. Adding files and directories

2. Retrieving files

3. Deleting files

1. Adding Files and Directories to HDFS-

Before you can run Hadoop programs on data stored in HDFS, you'll need to put the data into HDFS first.

Let's create a directory and put a file in it. HDFS has a default working directory of/user/SUSER, where $USER
is your login user name. This directory isn't automatically created for you, though, so Let's create it with the
mkdir command.

Commands:

1. Create a directory in HDFS at the given path(s) Usage: hadoop


fs -mkdir <paths>
Example:

hadoop fs -mkdir /user/aditya/dir1/user/aditya/dir2

2. List the contents of a directory.

Usage:

hadoop fs -ls <args>

Example:

hadoop fs -ls /user/aditya

3. Copy the contents of a file:


Usage:
hadoop fs-copyFromLocal <local source> <hdfs destination>

Example:

hadoop fs -copyToLocal /user/aditya/myfile/path/to/local/destination

4. Copy the file from the local system:


Usage:

hadoop fs -put <local source> <hdfs destination> Example: hadoop fs


-put/path/to/local/file/user/aditya/myfil
2. Retrieving Files from HDFS-

Retrieving files from HDFS involves reading data from the distributed file system and copying it to the local
file system. Hadoop provides several command-line tools to retrieve files from HDFS.

Commands:

1. Displaying contents of the file on the console: Usage:


hadoop fs-cat <file> Example:

hadoop fs-cat/user/data/sampletext.txt

2. Copies files from hdfs to the local system: Usage:

hadoop fs -get <hdfs-source-path> <local-destination-path> Example:


hadoop fs -get /user/myusername/myfile.txt /path/to/local/myfile.txt

3. Move files within the hdfs:

Usage:

hadoop fs -mv <source> <destination>

Example:

hadoop fs -mv /user/myusername/source/myfile.txt


/user/myusername/destination/myfile.txt

3. Deleting Files from HDFS-

Deleting files from Hadoop involves removing data from the distributed file system. Hadoop provides several
command-line tools to delete files from HDFS, including hadoop fs-rm, hadoop fs -rmr, hadoop fs -expunge,
and others.

Commands:

1. Deleting files from HDFS: Usage:


hadoop fs -rm <HDFS file path>
Example:
hadoop fs -rm/user/myusername/myfile
Experiment-3

Objective: Implementation of Matrix Multiplication with Hadoop Map Reduce.


Map function:

public static class Map extends Mapper<LongWritable, Text, Text, Text>

{ private Text outputKey = new Text();

private Text outputValue = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

// Parse input values

String[] line = value.toString().split(","); String

matrix = line[0];

int row = Integer.parseInt(line[1]); int

col = Integer.parseInt(line[2]); int val =

Integer.parseInt(line[3]);

// Emit intermediate key-value

pairs if (matrix.equals("A")) {

for (int k = 0; k < NUM_COLS_B; k++) { outputKey.set(row + ","

+ k); outputValue.set(matrix + "," + col +

"," + val); context.write(outputKey,

outputValue);

} else {

for (int i = 0; i < NUM_ROWS_A; i++) { outputKey.set(i + "," +

col); outputValue.set(matrix + "," + row + "," + val);

context.write(outputKey, outputValue);

Reduce function

public static class Reduce extends Reducer<Text, Text, Text,

IntWritable> { private IntWritable outputValue = new IntWritable();

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,

InterruptedException { // Initialize variables


int[] vectorA = new int[NUM_COLS_A]; int[]

vectorB = new int[NUM_ROWS_B];

// Parse input values

for (Text value : values) {

String[] line = value.toString().split(","); String

matrix = line[0];

int row = Integer.parseInt(line[1]); int

val = Integer.parseInt(line[2]); if

(matrix.equals("A")) {

vectorA[row] = val;

} else {

vectorB[row] = val;

// Compute dot product and emit final key-value pair int

sum = 0; for (int i = 0; i < NUM_COLS_A; i++) {

sum += vectorA[i] * vectorB[i];

outputValue.set(sum);

context.write(key, outputValue);

Driver program:

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "matrix multiplication");

job.setJarByClass(MatrixMultiplication.class);

// Set input and output paths

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// Set mapper and reducer classes

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);
// Set input and output key-value

classes

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.

class);

// Submit job and wait for completion System.exit(job.waitForCompletion(true) ? 0 :

1
Experiment-4

Objective: Write a MapReduce program that mines weather data. Hint: Weather sensors collecting data
every hour at many locations across the globe gather a large volume of log data, which is a good
candidate for analysis with Map Reduce, since it is semi structured and record-oriented.

PROGRAM LOGIC:

Word Count is a simple program which counts the number of occurrences of each word in a given text input
data set. Word Count fits very well with the MapReduce programming model making it a great example to
understand the Hadoop Map/Reduce programming style. Our implementation consists of three main parts:
1. Mapper

2. Reducer

3. Main program

Step 1: Mapper The Mapper extends the map function


from org.apache.hadoop.mapreduce.Mapper. It processes input pairs of <line_number, line_of_text> and outputs
<temperature, count> for each temperature reading in the text.
Pseudo-code:
void Map (key, value){
for each max_temp x in value:

output.collect(x, 1);
}
void Map (key, value){
for each min_temp x in value:
output.collect(x, 1);
}

Step-2 Write a Reducer


A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single
result. Here, the WordCount program will sum up the occurrence of each word to pairs as<word,
occurrence>.
Pseudo-code
void Reduce (max_temp, <list of value>){ for each x in <list
of value>:
sum+=x;final_output.collect(max_temp, sum);

}
void Reduce (min_temp, <list of value>){ for each x in <list
of value>:
sum+=x; final_output.collect(min_temp, sum);
}
3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform basic
configurations such as:
Job Name : name of this Job Executable (Jar)
Class: the main executable class. For here, WordCount.

Mapper Class: class which overrides the "map" function. For here, Map. Reducer: class
which override the "reduce" function. For here , Reduce. Output Key: type of output
key. For here, Text.Output
Value: type of output value. For here, IntWritable. File Input Path
File Output Path

INPUT/OUTPUT:
Set of Weather Data over the years:
Experiment : 5
Objective: To run a basic Word Count program using the MapReduce paradigm to understand how it
works.
Map Function: The map function tokenizes each line of input text into words and emits a key-value pair for each
word with a value of one.

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private fnal static IntWritable ONE = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, ONE);

Reduce function: The reduce function sums up the counts for each word and emits a final key-value pair with
the word and its total count.
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}

Driver program: Driver Program: The driver program sets up the job configuration, specifying the input and
output paths, the mapper and reducer classes, and the output key-value types. It then runs the job and waits for its
completion.

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word
count");
job.setJarByClass(WordCount.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
System.exit(job.waitForCompletion(true)? 0 : 1);
}
Experiment : 6

Implementation of K-means clustering using Map Reduce


K-means clustering is an unsupervised machine learning algorithm that is widely used for data clustering and
analysis. MapReduce is a programming model and software framework that is used for processing large datasets
in a distributed and parallel manner. The MapReduce framework can be used to implement K-means clustering
algorithm for big data.

The following are the steps to implement K-means clustering using MapReduce:

Input Data Preparation: The input data should be split into multiple blocks and stored in a distributed file
system such as Hadoop Distributed File System (HDFS).

Initialization: Initialize the centroids by randomly selecting k points from the input data. These k points will be
the initial centroids for the K-means algorithm.

Map Step: In the Map step, each data point is assigned to the nearest centroid. The distance metric used for
measuring the distance between data points and centroids can be Euclidean distance or Manhattan distance.

Reduce Step: In the Reduce step, the mean of all the data points assigned to a centroid is calculated. This mean
becomes the new centroid for that cluster.

Repeat Steps 3 and 4: The Map and Reduce steps are repeated until the centroids converge. The convergence
can be checked by comparing the new centroids with the old centroids. If the new centroids are the same as the
old centroids, the algorithm has converged.

Output Results: The final output of the algorithm is the set of k centroids.

Implementing K-means clustering using MapReduce can be computationally expensive due to the MapReduce
overhead, but it can be parallelized to process large datasets efficiently. The scalability of MapReduce makes it
an ideal framework for big data processing and analysis.

Program :

import java.io.IOException;
import java.util.ArrayList; import
java.util.List;
import org.apache.hadoop.conf.Configuration; import
org.apache.hadoop.fs.FileSystem; import
org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool; import
org.apache.hadoop.util.ToolRunner; public class
KMeansMR implements Tool {
private Configuration conf;
public static void main(String[] args) throws Exception { int res =
ToolRunner.run(new KMeansMR(), args); System.exit(res);
}
public int run(String[] args) throws Exception { if
(args.length < 3) {

System.err.println("Usage: KMeansMR <input> <output> <k>"); return 2;


}
Path inputPath = new Path(args[0]); Path
outputPath = new Path(args[1]); int k =
Integer.parseInt(args[2]);
// Set the number of
clusters conf.setInt("k",
k);
Job job = Job.getInstance(getConf());
job.setJarByClass(KMeansMR.class);
job.setMapperClass(KMeansMapper.class);
job.setReducerClass(KMeansReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, inputPath);
FileOutputFormat.setOutputPath(job,
outputPath); return job.waitForCompletion(true)
? 0 : 1;
}
public Configuration getConf() { return
conf;
}
public void setConf(Configuration conf) { this.conf =
conf;
}
public static class KMeansMapper extends Mapper<Object, Text, IntWritable,
Text> { private List<Double[]> centroids;
protected void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
int k = conf.getInt("k", 2); centroids
= new ArrayList<>();
Path centroidsPath = new Path("hdfs://path/to/centroids.txt");
FileSystem fs = FileSystem.get(conf);
FSDataInputStream in = fs.open(centroidsPath);
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line; while ((line = reader.readLine()) != null) { String[]
parts = line.split(",");
Double[] centroid = new Double[parts.length]; for (int i =
0; i < parts.length; i++) {
centroid[i] = Double.parseDouble(parts[i]);
}
centroids.add(centroid);
}
reader.close();
in.close();
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{ String[] parts = value.toString().split(",");
Double[] point = new Double[parts.length]; for (int i =
0; i < parts.length; i++) {
point[i] = Double.parseDouble(parts[i]);
}
int closest = 0;
double closestDistance = Double.MAX_VALUE; for
(int i = 0; i < centroids.size(); i++) {
Double[] centroid = centroids.get(i); double
distance = 0.0;
for (int j = 0; j < centroid.length; j++) {
distance += Math.pow(centroid[j] - point[j], 2);
}
distance = Math.sqrt(distance); if
(distance < closestDistance) {
closest = i;
closestDistance = distance;
}
}
context.write(new IntWritable(closest), value);
}
}
public static class KMeansReducer extends Reducer<IntWritable, Text, IntWritable, Text>
{ public void reduce(IntWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<Double[]> points = new ArrayList<>();
for (Text value : values) {
String[] parts = value.toString().split(","); Double[]
point = new Double[parts.length]; for (int i = 0; i <
parts.length; i++) {
point[i] = Double.parseDouble(parts[i]);
}
points.add(point);
}
Double[] centroid = computeCentroid(points);
context.write(key, new Text(arrayToString(centroid)));
}

private Double[] computeCentroid(List<Double[]> points) { int


numDimensions = points.get(0).length;
Double[] centroid = new Double[numDimensions]; for (int i =
0; i < numDimensions; i++) {
double sum = 0.0;
for (int j = 0; j < points.size(); j++) { sum
+= points.get(j)[i];
}
centroid[i] = sum / points.size();
}
return centroid;
}

private String arrayToString(Double[] array) { StringBuilder sb =


new StringBuilder();
for (int i = 0; i < array.length; i++) {
sb.append(array[i]);
if (i < array.length - 1) {
sb.append(",");
}
}
return sb.toString();
}
}
}
Experiment : 7

Installation of Hive along with practice examples.

To install Apache Hive, we can follow these steps:

1. First we downloaded the latest version of Apache Hive from the official website:
https://hive.apache.org/downloads.html
2. Then we have extracted the downloaded archive to a directory of our choice.
3. Then we set the environment variables HIVE_HOME and PATH to the location where we
have extracted Hive. For example, we have extracted it to /usr/local/hive, so we have added the
following lines to .bashrc or .bash_profile
file:
export HIVE_HOME=/usr/local/hive export
PATH=$PATH:$HIVE_HOME/bin
4. We Started the Hive metastore service by running the following command:
$HIVE_HOME/bin/hive --service metastore &
5. Start the Hive server by running the following command:
6. $HIVE_HOME/bin/hiveserver2 &
7. Now it is running an instance of Hive that we can connect to and use.

To practice using Hive, we have used the Hive command-line interface (CLI):

Create a table:

CREATE TABLE example_table (

id INT,

name

STRING,

age INT

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';

8. Load some data into the table:

LOAD DATA LOCAL INPATH '/path/to/data' INTO TABLE example_table;

9. Query the table:

SELECT * FROM example_table WHERE age > 20;

10. Create a new table based on the query

results: CREATE TABLE new_table AS

SELECT name, age FROM example_table WHERE age > 20;


Experiment : 8

Installation of HBase, Installing thrift along with Practice examples

To install Apache HBase, you can follow these steps:

Download the latest version of Apache HBase from the official website:

https://hbase.apache.org/downloads.html Extract the downloaded archive to a directory of your

choice.

Set the environment variable HBASE_HOME to the location where you extracted HBase. For example, if
you extracted it to /usr/local/hbase, you would add the following line to your .bashrc or .bash_profile file:
export HBASE_HOME=/usr/local/hbase

Start HBase by running the following command:

$HBASE_HOME/bin/start-hbase.sh

You should now have a running instance of HBase that you can connect to and use. To

install Apache Thrift, you can follow these steps:

Download the latest version of Apache Thrift from the official website:
https://thrift.apache.org/download

Extract the downloaded archive to a directory of your choice.

Install the required dependencies for your platform. This will typically include a C++ compiler, Python, and
the development headers for your operating system.

Build and install Thrift by running the following commands:

./configu

re make

make install

You should now have Thrift installed and ready to use.

To practice using HBase and Thrift, you can try out some

examples: Start the HBase shell by running the following

command: $HBASE_HOME/bin/hbase shell

Create a table:

create 'example_table', 'cf' Insert

some data into the table:

put 'example_table', 'row1', 'cf:name', 'Alice' put

'example_table', 'row1', 'cf:age', '30'

put 'example_table', 'row2', 'cf:name', 'Bob' put

'example_table', 'row2', 'cf:age', '25'

Query the data using the Thrift API. First, generate the Thrift bindings for your language of choice:
thrift -r -gen <language> /path/to/HBase.thrift For

example, to generate Java bindings:

thrift -r -gen java /path/to/HBase.thrift


Experiment 9

Practice importing and exporting data from various data bases .

Importing and exporting data from databases is an essential part of data management and analysis. Here are
some examples of how to import and export data from various databases.

Importing and exporting data from MySQL

To import data from MySQL, you can use the MySQL command line client or a graphical client like MySQL
Workbench. Here is an example of how to import data using the command line client:

mysql -u username -p dbname < data.sql

This command will import the data from the file data.sql into the database dbname.

To export data from MySQL, you can use the mysqldump command. Here is an example of how to
export data using the mysqldump command:
mysqldump -u username -p dbname > data.sql

This command will export the data from the database dbname into the file data.sql.

Importing and exporting data from PostgreSQL

To import data from PostgreSQL, you can use the psql command line client. Here is an example of how to
import data using the psql client:

psql -U username -d dbname -f data.sql

This command will import the data from the file data.sql into the database dbname.

To export data from PostgreSQL, you can use the pg_dump command. Here is an example of how to export
data using the pg_dump command:

pg_dump -U username -d dbname -t tablename -f data.sql

This command will export the data from the table tablename in the database dbname into the

file data.sql. Importing and exporting data from MongoDB

To import data into MongoDB, you can use the mongoimport command line tool. Here is an example of how to
import data using the mongoimport tool:

mongoimport --host localhost --db dbname --collection collectionname --file data.json

This command will import the data from the file data.json into the collection collectionname in the database
dbname.

To export data from MongoDB, you can use the mongoexport command line tool. Here is an example of
how to export data using the mongoexport tool:

mongoexport --host localhost --db dbname --collection collectionname --out data.json

This command will export the data from the collection collectionname in the database dbname into

the file data.json. Importing and exporting data from Oracle

To import data into Oracle, you can use the SQLLoader utility. Here is an example of how to import data using
SQLLoader:
sqlldr username/password@dbname control=loader.ctl

This command will import the data specified in the control file loader.ctl into the database dbname. To export
data from Oracle, you can use the Data Pump Export utility. Here is an example of how toexport data using Data
Pump Export:

expdp username/password@dbname tables=tablename directory=export_dir dumpfile=data.dmp

This command will export the data from the table tablename in the database dbname into the file
data.dmp. The directory parameter specifies the directory where the file will be written.
Experiment : 10

Write PIG Commands: Write Pig Latin scripts sort, group, join, project, and filter your data.

Sorting data

To sort data in Pig Latin, you can use the ORDER BY clause followed by the name of the column you
want to sort by. For example, the following script sorts data by the age column in ascending order:

data = LOAD 'input_data' AS (name:chararray, age:int, gender:chararray); sorted_data

= ORDER data BY age ASC;

STORE sorted_data INTO 'output_data';

Grouping data

To group data in Pig Latin, you can use the GROUP BY clause followed by the name of the column you want
to group by. For example, the following script groups data by the gender column:

data = LOAD 'input_data' AS (name:chararray, age:int,


gender:chararray); grouped_data = GROUP data BY gender;

STORE grouped_data INTO 'output_data';

Joining data

To join data in Pig Latin, you can use the JOIN clause followed by the names of the relations you want to
join and the join condition. For example, the following script joins two relations (data1 and data2) on the
name column:

data1 = LOAD 'input_data1' AS (name:chararray, age:int);

data2 = LOAD 'input_data2' AS (name:chararray, gender:chararray);

joined_data = JOIN data1 BY name, data2 BY name;

STORE joined_data INTO 'output_data';

Projecting data

To project specific columns in Pig Latin, you can use the FOREACH clause followed by the name of the
columns you want to project. For example, the following script projects the name and age columns:

data = LOAD 'input_data' AS (name:chararray, age:int, gender:chararray); projected_data =

FOREACH data GENERATE name, age;

STORE projected_data INTO 'output_data';

Filtering data

To filter data in Pig Latin, you can use the FILTER clause followed by the condition you want to filter by.
For example, the following script filters data where the age column is greater than or equal to 18:

data = LOAD 'input_data' AS (name:chararray, age:int, gender:chararray);


filtered_data = FILTER data BY age >= 18;
STORE filtered_data INTO 'output_data';
Pig Latin script that demonstrates multiple operations and functions.
-- Load data from input file
data = LOAD '/path/to/input/file' USING PigStorage(',') AS (id:int, name:chararray, age:int, gender:chararray,
salary:float);
--Filter data to include only male employees
male_data = FILTER data BY gender == 'Male';
--Group male employees by age
grouped_data = GROUP male_data BY
age;
-- Calculate average salary for each age group
avg_salary = FOREACH grouped_data GENERATE group AS age, AVG(male_data.salary) AS avg_salary;
--Sort the results by average salary in descending order
sorted_data = ORDER avg_salary BY avg_salary DESC;
--Limit the output to the top 10 age groups with highest average salary
top_10_data = LIMIT sorted_data 10;
-- Store the results to output file
STORE top_10_data INTO '/path/to/output/file';
This script loads data from an input file and performs the following operations:
* Filters the data to include only male employees.
* Groups the male employees by age.
* Calculates the average salary for each age group.
* Sorts the results by average salary in descending order.
* Limits the output to the top 10 age groups with the highest average salary.
* Stores the results in an output file.
This script combines multiple Pig Latin operations, including LOAD, FILTER, GROUP, FOREACH,
GENERATE, ORDER, LIMIT, and STORE, and uses built-in functions such as AVG() to perform calculations
on the data.
Experiment : 11

Run the Pig Latin Scripts to find Word

Count . Word count from a single file:

-- Load data from input file

data = LOAD '/path/to/input/file' USING TextLoader();

-- Split each line into words

words = FOREACH data GENERATE FLATTEN(TOKENIZE($0)) AS word;

--Group the words and count the

occurrences word_count = GROUP

words BY word;

word_count = FOREACH word_count GENERATE group AS word, COUNT(words) AS count;

-- Store the results to output file

STORE word_count INTO '/path/to/output/file';


This script loads data from a text file and performs the following operations:

* Splits each line into words.


* Groups the words and counts the occurrences.
* Stores the results in an output file.

Word count from multiple files:

-- Load data from input files


data = LOAD '/path/to/input/files/*' USING TextLoader();
-- Split each line into words
words = FOREACH data GENERATE FLATTEN(TOKENIZE($0)) AS word;
--Group the words and count the
occurrences word_count = GROUP
words BY word;
word_count = FOREACH word_count GENERATE group AS word, COUNT(words) AS count;
-- Store the results to output file
STORE word_count INTO '/path/to/output/file';
This script loads data from multiple text files using the file glob pattern (*) and performs the same word count
operations as the previous script. The only difference is in the input sour
Experiment : 12

Run the Pig Latin Scripts to find a max temp for each and every year.

-- Load data from input file

data = LOAD '/path/to/input/file' USING PigStorage(',') AS (year:int, month:int, day:int,

temperature:float); -- Extract the year from the date

data = FOREACH data GENERATE year, temperature;

--Group the data by year

grouped_data = GROUP data BY

year;

-- Calculate the maximum temperature for each year

max_temp = FOREACH grouped_data GENERATE group AS year, MAX(data.temperature)

AS max_temp; -- Store the results to output file

STORE max_temp INTO '/path/to/output/file';

This script loads data from a file containing temperature readings for each day, including the year,

month, day, and temperature. The script performs the following operations:

* Extracts the year from the date.


* Groups the data by year.
* Calculates the maximum temperature for each year.
* Stores the results in an output file.

This script uses Pig Latin operations such as LOAD, FOREACH, GROUP, MAX, and STORE to perform the
calculations.

You might also like