Map Reduce Workflow Colloquim
Map Reduce Workflow Colloquim
Map Reduce Workflow Colloquim
Colloquium Report on
Session 2018-2019
Submitted By
JATIN PARASHAR
Roll Number
1703914906
AFFILIATED TO
Dr. A.P.J. Abdul Kalam Technical University (APJAKTU), LUCKNOW
STUDENTS DECLARATION
I hereby declare that the study done by me on the colloquium topic presented
in this report entitled Map Reduce Workflow is an authentic record
carried out under the supervision of Ms. JYOTI CHAUDHARY.
The matter embodied in this report has not been submitted by me for the award
of MASTER OF COMPUTER APPLICATIONS DEGREE.
This is to certify that the above statement made by the candidate is correct to
the best of my knowledge.
Name: Name:
Department: Department:
Designation:
Date:……………. Date:…………….
ACKNOWLEDGEMENT
I am hearty grateful to member of IMR college for cooperating and guiding me
at each and every step throughput the making of this Colloquium. I take it a
great privilege to avail this opportunity to express my deep gratitude to all those
who helped me and guided me in my training period.
1. MapReduce Overview
2. What is BigData?
3. MapReduce Working
4. MapReduce-Example
5. Significance
6. MapReduce algorithm
7. Sorting
8. Searching
9. Indexing
ABSTRACT
Fig.1
Google solved this bottleneck issue using an algorithm called MapReduce.
MapReduce divides a task into small parts and assigns them to many computers.
Later, the results are collected at one place and integrated to form the result
dataset.
3. MapReduce Working
The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
The Map task takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their
significance.
Fig. 3
Input Phase − Here we have a Record Reader that translates each record
in an input file and sends the parsed data to the mapper in the form of key-
value pairs.
Map − Map is a user-defined function, which takes a series of key-value
pairs and processes each one of them to generate zero or more key-value
pairs.
Intermediate Keys − The key-value pairs generated by the mapper are
known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar
data from the map phase into identifiable sets. It takes the intermediate
keys from the mapper as input and applies a user-defined code to aggregate
the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local machine, where
the Reducer is running. The individual key-value pairs are sorted by key
into a larger data list. The data list groups the equivalent keys together so
that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input
and runs a Reducer function on each one of them. Here, the data can be
aggregated, filtered, and combined in a number of ways, and it requires a
wide range of processing. Once the execution is over, it gives zero or more
key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.
Fig. 4
4. MapReduce-Example
Fig.5
As shown in the illustration, the MapReduce algorithm performs the
following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as
key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the
filtered maps as key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values
into small manageable units.
5. Significance
Here in MapReduce, we get inputs from the list and it coverts output which
again lists. Due to MapReduce, Hadoop is more powerful and efficient.
This is what a small overview of MapReduce, take a look on know how to
divide work into sub work and how MapReduce works. In this process, the
total work divided into small divisions. Each division will process in
parallel on the clusters of servers which can give individual output. And
finally, these individual outputs give the final output. It is scalable and can
use across many computers.
************************************************************
6. MAPREDUCE – ALGORITHM
The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
The map task is done by means of Mapper Class
Mapper class takes the input, tokenizes it, maps, and sorts it. The output of
Mapper class is used as input by Reducer class, which in turn searches matching
pairs and reduces them.
Searching
Indexing
TF-IDF
7. Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-
value pairs from the mapper by their keys. MapReduce
Sorting methods are implemented in the mapper class itself.
In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
theContext class (user-defined class) collects the matching valued keys as a
collection.
To collect similar key-value pairs (intermediate keys), the Mapper class takes
the help of RawComparator class to sort the key-value pairs.
8. Searching
Example
The Map phase processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>). See the following illustration.
MapReduce
The combiner phase (searching technique) will accept the input from the
Map phase as a key-value pair with employee name and salary. Using searching
technique, the combiner will check all the employee salary to find the highest
salaried employee in each file. See the following snippet.
9. Indexing
Normally indexing is used to point to a particular data and its address. It
performs batch indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as
inverted index. Search engines like Google and Bing use inverted indexing
technique. Let us try to understand how Indexing works with the help of a
simple example.
Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2]
are the file names and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1,
2} implies the term "is" appears in the files T[0], T[1], and T[2].
TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency −
Inverse Document Frequency. It is one of the common web analysis algorithms.
Here, the term 'frequency' refers to the number of times a term appears in a
document.
Example
Consider a document containing 1000 words, wherein the word hive appears 50
times. The TF for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000
of these. Then, the IDF is calculated as log (10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.
Consider you have following input data for your Map Reduce Program
MapReduce Architecture
The final output of the MapReduce task is
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1
Input Splits:
Mapping
This is the very first phase in the execution of map-reduce program. In this
phase data in each split is passed to a mapping function to produce output
values. In our example, a job of mapping phase is to count a number of
occurrences of each word from input splits (more details about input-split is
given below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output. In our example, the same words
are clubed together along with their respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In
short, this phase summarizes the complete dataset.
In our example, this phase aggregates the values from Shuffling phase i.e.,
calculates total occurrences of each word.
MapReduce Architecture explained in detail
One map task is created for each split which then executes map function
for each record in the split.
It is always beneficial to have multiple splits because the time taken to
process a split is small as compared to the time taken for processing of
the whole input. When the splits are smaller, the processing is better to
load balanced since we are processing the splits in parallel.
However, it is also not desirable to have splits too small in size. When
splits are too small, the overload of managing the splits and map task
creation begins to dominate the total job execution time.
For most jobs, it is better to make a split size equal to the size of an
HDFS block (which is 64 MB, by default).
Execution of map tasks results into writing output to a local disk on the
respective node and not to HDFS.
Reason for choosing local disk over HDFS is, to avoid replication which
takes place in case of HDFS store operation.
Map output is intermediate output which is processed by reduce tasks to
produce the final output.
Once the job is complete, the map output can be thrown away. So, storing
it in HDFS with replication becomes overkill.
In the event of node failure, before the map output is consumed by the
reduce task, Hadoop reruns the map task on another node and re-creates
the map output.
Reduce task doesn't work on the concept of data locality. An output of
every map task is fed to the reduce task. Map output is transferred to the
machine where reduce task is running.
On this machine, the output is merged and then passed to the user-defined
reduce function.
Unlike the map output, reduce output is stored in HDFS (the first replica
is stored on the local node and other replicas are stored on off-rack
nodes). So, writing the reduce output
as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a
A job is divided into multiple tasks which are then run onto multiple data
nodes in a cluster.
It is the responsibility of job tracker to coordinate the activity by
scheduling tasks to run on different data nodes.
Execution of individual task is then to look after by task tracker, which
resides on every data node executing part of the job.
Task tracker's responsibility is to send the progress report to the job
tracker.
In addition, task tracker periodically sends 'heartbeat' signal to the
Jobtracker so as to notify him of the current state of the system.
Thus job tracker keeps track of the overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different task
tracker.
Joining of two datasets begins by comparing the size of each dataset. If one
dataset is smaller as compared to the other dataset then smaller dataset is
distributed to every data node in the cluster. Once it is distributed, either
Mapper or Reducer uses the smaller dataset to perform a lookup for matching
records from the large dataset and then combine those records to form output
records.
Types of Join
Depending upon the place where the actual join is performed, this join is
classified into-
File 2
Input: The input data set is a txt file, DeptName.txt & DepStrength.txt
Ensure you have Hadoop installed. Before you start with the actual process,
change user to 'hduser' (id used while Hadoop configuration, you can switch to
the userid used during your Hadoop config ).
su - hduser_
Step 1) Copy the zip file to the location of your choice
cd MapReduceJoin/
Step 4) Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
Step 5) DeptStrength.txt and DeptName.txt are the input files used for this
program.
In addition to built-in counters, a user can define his own counters using similar
functionalities provided by programming languages. For example,
in Java 'enum' are used to define user defined counters.
Counters Example
if (country.length() == 0) {
reporter.incrCounter(SalesCounters.MISSING, 1);
} else if (sales.startsWith("\"")) {
reporter.incrCounter(SalesCounters.INVALID, 1);
} else {
output.collect(new Text(country), new Text(sales + ",1"));
}
}
}
In the code snippet, if 'country' field has zero length then its value is missing
and hence corresponding counter SalesCounters.MISSING is incremented.
Next, if 'sales' field starts with a " then the record is considered INVALID.
This is indicated by incrementing counter SalesCounters.INVALID.
LIST OF REFERENCES
1. https://onlineitguru.com/blog/importance-of-map-reduce-in-hadoop
2. https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
3. https://www.dummies.com/programming/big-data/hadoop/the-
importance-of-mapreduce-in-hadoop/
4. https://data-flair.training/blogs/hadoop-mapreduce-tutorial/