Anatomy of A MapReduce Job

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Anatomy of a MapReduce Job

Hadoop MapReduce jobs are divided into a set of map tasks and reduce tasks that run
in a distributed fashion on a cluster of computers. Each task work on a small subset of
the data it has been assigned so that the load is spread across the cluster.
The input to a MapReduce job is a set of files in the data store that are spread out over
the HDFS. In Hadoop, these files are split with an input format, which defines how to
separate a files into input split. You can assume that input split is a byte-oriented view of
a chunk of the files to be loaded by a map task.
The map task generally performs loading, parsing, transformation and filtering
operations, whereas reduce task is responsible for grouping and aggregating the data
produced by map tasks to generate final output. This is the way a wide range of
problems can be solved with such a straightforward paradigm, from simple numerical
aggregation to complex join operations and Cartesian products.
Each map task in Hadoop is broken into following phases: record reader, mapper,
combiner and partitioner. The output of map phase, called intermediate key and values
are sent to the reducers.
The reduce tasks are broken into following phases: shuffle, sort, reducer and output
format.

The map tasks are assigned by Hadoop framework to those DataNodes where the
actual data to be processed resides. This ensures that the data typically doesnt have to
move over the network to save the network bandwidth and data is computed on the local
machine itself so called map task is data local.
at Mapper side:
1. Record Reader:
The record reader translates an input split generated by input format into records. The
purpose of record reader is to parse the data into record but doesnt parse the record
itself.
It passes the data to the mapper in the form of key/value pair. Usually the key in this
context is positional information (word) and the value is a chunk of data that composes a
record (number).
1

2. Map:
Map function is the heart of mapper task, which is executed on each key/value pair from
the record reader to produce zero or more key/value pair, called intermediate pairs. The
decision of what is key/value pair depends on what the MapReduce job is
accomplishing. The data is grouped on key and the value is the information pertinent to
the analysis in the reducer.
3. Combiner:
Its an optional component but highly useful and provides extreme performance gain of
MapReduce job without any downside. Combiner is not applicable to all the MapReduce
algorithms but where ever it can be applied (for large datasets) it is always
recommended to use. It takes the intermediate keys from the mapper and applies a
user-provided method to aggregate values in a small scope of that one mapper. e.g
sending (hadoop, 3) requires fewer bytes than sending (hadoop, 1) three times over the
network.
4. Partitioner:
The partitioner takes the intermediate key/value pairs from mapper and split them into
shards, one shard per reducer. This randomly distributes the keyspace evenly over the
reducer, but still ensures that keys with the same value in different mappers end up at
the same reducer. The partitioned data is written to the local filesystem for each map
task and waits to be pulled by its respective reducer.

At Reducer Side:
1. Shuffle and Sort:
The reduce task start with the shuffle and sort step. This step takes the output files
written by all of the partitioners and downloads them to the local machine in which the
reducer is running. These individual data pipes are then sorted by keys into one larger
data list. The purpose of this sort is to group equivalent keys together so that their
values can be iterated over easily in the reduce task.

2. Reduce:
The reducer takes the grouped data as input and runs a reduce function once per key
grouping. The function is passed the key and an iterator over all the values associated
with that key. A wide range of processing can happen in this function, the data can be
aggregated, filtered, and combined in a number of ways. Once it is done, it sends zero
or more key/value pair to the final step, the output format.

3. Output Format:
The output format translate the final key/value pair from the reduce function and writes it
out to a file by a record writer. By default, it will separate the key and value with a tab
and separate record with a new line character. We will discuss in our future articles
about how to write your own customized output format.

Note:
Anyone interested in using Hadoop can read detailed instructions on how to collect and
process data and build applications; how to explore, query and deliver insights out of big
data; as well as how to provision, manage and monitor Hadoop.
Practical:
http://hortonworks.com/get-started/develop/
http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/
http://hortonworks.com/get-started/analyze/
http://hortonworks.com/labs/security/
http://hortonworks.com/get-started/
5

You might also like