ECS765P_W2_The MapReduce Programming Model
ECS765P_W2_The MapReduce Programming Model
● Introduction to MapReduce
● MapReduce Implementation
● MapReduce Programming patterns
● Aggregate computations
Our first parallel program (Reminder from last week)
● Task: count the number of occurrences of each word in one document
● Input: text document
● Output: sequence of: word, count
The 56
School 23
Queen 10
For example:
If we have several nodes, a simple way to parallelise the solution would be to:
● Assign a partial fragment of the input text to each node
● Each node solves the Word-Count problem for its fragment of text
● The results from each processor are merged together
Sounds easy but, how can we instruct a cluster of nodes to execute this simple job?
MapReduce
“A simple and powerful interface that enables automatic parallelization and distribution of large-scale
computations, combined with an implementation of this interface that achieves high performance on
large clusters of commodity PCs.”
(Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc, 2004.)
Map Reduce
Example wordcount (pythonish pseudocode)
def mapper(_,text):
words = text.split()
for word in words:
emit(word, 1)
...
map map
Data store 1 Data store n
● Introduction to MapReduce?
● MapReduce Implementation
● MapReduce programming pattern
● Aggregate computations
MapReduce: Implementation
The MapReduce programming model defines two main functions, namely Map and Reduce. Both
functions encapsulate the logic of the solution and can be parallelised by allocating them to different
nodes.
However, in order for this strategy to work, in addition to expressing the logic of the solution as map and
reduce functions, we need to partition the input data and partition and move key/value pairs from
mappers to reducers
MapReduce specifies how this is implemented. When creating a MapReduce job, we do not need to
worry about implementing this: the MapReduce framework will do it for us.
Word Count Example
Partitions and move Key-Value pairs
Partitions Input Data
Reducers then
● Obtain their partitions from each mapper (shuffling)
● Partitions from different mappers are grouped by key (sort). A group of key/value pairs with the same
key is expressed as a single key/value pair, where the new value is a list of all the input values.
● The reduce function is called on this key/value pair
Shuffle and Sort
Shuffle and Sort – on each Mapper
Combiner is an additional optional step that can be optionally executed before reducer to reduce the
communication volume.
The Combiner
● Introduction to MapReduce?
● MapReduce Implementation
● MapReduce programming pattern
● Aggregate computations
MapReduce is not a tool, but a framework: any solution has to be expressed using map and reduce
functions. It turns our that some problems are amenable to the MapReduce model, others are not.
A family of problems whose solution follow the same strategy to be implemented in the MapReduce
framework is called a MapReduce programming pattern.
Pattern 1: Inverted Index
Goal: Generate index from a dataset to allow faster searches for specific features
Examples:
● Building index from a textbook.
● Finding all websites that match a search term
Inverted Index Structure
Inverted Index Pattern
Examples:
● Distributed grep (text pattern match)
● Tracking a thread of events (logs from the same user)
● Data cleansing
Examples:
● Build top sellers view
● Find outliers in data
In this pattern only ONE reducer will be used. Hence, there is no need to perform partition and
shuffling. Sorting will be part of the reduce function.
Top-Ten structure
Top-Ten Pattern
Minimum requirement: the ranking data for a whole input split must fit into memory of a single
Reducer
Performance Considerations
● How many Mappers and Reducers instances?
● Performance issues for using too many?
● What happens if we don’t use Combiners?
Performance depends greatly on the number of elements, (to a lesser extent on the size of data)
à For instance, data could have been filtered by the ingestion stage
Contents
● Introduction to MapReduce?
● MapReduce Implementation
● MapReduce Programming patterns
● Aggregate computations
Numerical Summarisation
Goal: Calculate aggregate statistical values over a dataset
Extract features from the dataset elements and compute the same statistical function for each feature
Examples:
● Count occurrences
● Maximum / minimum values
● Average / median / standard deviation
Sample numerical summarisation questions
● Compute what is the maximum PM2.5 registered for each location provided in the dataset
● Return the average Air Quality Index (AQI) registered each week
● Compute for each day of the week the number of locations where the PM2.5 index exceeded 150
Numerical Summarisation Structure
Numerical Summarisation Map and Reduce functions
Numerical Summarisation Combiner?
Meaning it is applicable to SUM, MUL or COUNT (associative) but not SUB, AVERAGE (not associative)
https://en.wikipedia.org/wiki/Associative_property
Computing Averages
Computing Averages
Combining Average
Average is NOT an associative operation
● Cannot be executed partially with the Combiners
Solution: Change Mapper results to be associative (e.g., sum and count)
Emit aggregated quantities, and number of elements
● Mapper: For mark values (100,100,20),
Emit (100,1),(100,1), (20,1)
● Combiner: adds aggregates and number of elements
Emits (220,3)
● Reducer
Adds aggregates (sum, count) from different mappers and computes average
Contents
● Introduction to MapReduce?
● MapReduce Implementation
● MapReduce Programming patterns
● Aggregate computations