Hadoop MapReduce Join & Counter With Example
Hadoop MapReduce Join & Counter With Example
Hadoop MapReduce Join & Counter With Example
Once a join in MapReduce is distributed, either Mapper or Reducer uses the smaller dataset to
perform a lookup for matching records from the large dataset and then combine those records
to form output records.
Types of Join
Depending upon the place where the actual join is performed, joins in Hadoop are classified
into-
1. Map-side join - When the join is performed by the mapper, it is called as map-side join. In this
type, the join is performed before data is actually consumed by the map function. It is
mandatory that the input to each map is in the form of a partition and is in sorted order. Also,
there must be an equal number of partitions and it must be sorted by the join key.
2. Reduce-side join - When the join is performed by the reducer, it is called as reduce-side join.
There is no necessity in this join to have a dataset in a structured form (or partitioned).
Here, map side processing emits join key and corresponding tuples of both the tables. As an
effect of this processing, all the tuples with same join key fall into the same reducer which then
joins the records with same join key.
(/images/Big_Data/061114_1032_MapReduceHa1.png)
il
File 1
(/images/Big_Data/061114_1032_MapReduceHa2.png)
File 2
Ensure you have Hadoop installed. Before you start with the MapReduce Join example actual
process, change user to 'hduser' (id used while Hadoop configuration, you can switch to the
userid used during your Hadoop config ).
su - hduser_
(/images/Big_Data/061114_1032_MapReduceHa3.png)
(/images/Big_Data/061114_1032_MapReduceHa5.png)
cd MapReduceJoin/
(/images/Big_Data/061114_1032_MapReduceHa6.png)
Step 4) Start Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
(/images/Big_Data/061114_1032_MapReduceHa7.png)
Step 5) DeptStrength.txt and DeptName.txt are the input files used for this MapReduce Join
example program.
(/images/Big_Data/061114_1032_MapReduceHa8.png)
(/images/Big_Data/061114_1032_MapReduceHa9.png)
(/images/Big_Data/061114_1032_MapReduceHa10.png)
Step 7) After execution, output file (named 'part-00000') will stored in the directory
/output_mapreducejoin on HDFS
(/images/Big_Data/061114_1032_MapReduceHa11.png)
(/images/Big_Data/061114_1032_MapReduceHa13.png)
Open part-r-00000
(/images/Big_Data/061114_1032_MapReduceHa14.png)
NOTE: Please note that before running this program for the next time, you will need to delete
output directory /output_mapreducejoin
Hadoop Counters are similar to putting a log message in the code for a map or reduce. This
information could be useful for diagnosis of a problem in MapReduce job processing.
Typically, these counters in Hadoop are defined in a program (map or reduce) and are
incremented during execution when a particular event or condition (specific to that counter)
occurs. A very good application of Hadoop counters is to track valid and invalid records from an
input dataset.
1. Hadoop Built-In counters:There are some built-in Hadoop counters which exist per job.
Below are built-in counter groups-
MapReduce Task Counters - Collects task specific information (e.g., number of input
records) during its execution time.
FileSystem Counters - Collects information like number of bytes read or written by a
task
FileInputFormat Counters - Collects information of a number of bytes read through
FileInputFormat
FileOutputFormat Counters - Collects information of a number of bytes written
through FileOutputFormat
Job Counters - These counters are used by JobTracker. Statistics collected by them
include e.g., the number of task launched for a job.
2. User Defined Counters
In addition to built-in counters, a user can define his own counters using similar functionalities
provided by programming languages (/best-programming-language.html). For example, in
Java (/java-tutorial.html)'enum' are used to define user defined counters.
Counters Example
An example MapClass with Counters to count the number of missing and invalid values. Input
data file used in this tutorial Our input data set is a CSV file, SalesJan2009.csv
(https://drive.google.com/uc?export=download&id=0B_vqvT0ovzHccGJ1VjVic1AwbGc)
public static class MapClass
extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text>
{
static enum SalesCounters { MISSING, INVALID };
public void map ( LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException
{
if (country.length() == 0) {
reporter.incrCounter(SalesCounters.MISSING, 1);
} else if (sales.startsWith("\"")) {
reporter.incrCounter(SalesCounters.INVALID, 1);
} else {
output.collect(new Text(country), new Text(sales + ",1"));
}
}
}
Above code snippet shows an example implementation of counters in Hadoop Map Reduce.
In the code snippet, if 'country' field has zero length then its value is missing and hence
corresponding counter SalesCounters.MISSING is incremented.
Next, if 'sales' field starts with a " then the record is considered INVALID. This is indicated by
incrementing counter SalesCounters.INVALID.
Next (/introduction-to-flume-and-sqoop.html)
YOU MIGHT LIKE:
BigData Tutorials
5) MAPReduce (/introduction-to-mapreduce.html)
8) Sqoop (/introduction-to-flume-and-sqoop.html)
9) FLUME (/create-your-first-flume-program.html)
(https://www.facebook.com/guru99com/)
(https://twitter.com/guru99com)
(https://www.linkedin.com/company/guru99/)
(https://www.youtube.com/channel/UC19i1XD6k88KqHlET8atqFQ)
(https://forms.aweber.com/form/46/724807646.htm)
About
About Us (/about-us.html)
Advertise with Us (/advertise-us.html)
Write For Us (/become-an-instructor.html)
Contact Us (/contact-us.html)
Career Suggestion
SAP Career Suggestion Tool (/best-sap-module.html)
Software Testing as a Career (/software-testing-career-
complete-guide.html)
Interesting
eBook (/ebook-pdf.html)
Blog (/blog/)
Quiz (/tests.html)
SAP eBook (/sap-ebook-pdf.html)
Execute online
Execute Java Online (/try-java-editor.html)
Execute Javascript (/execute-javascript-online.html)
Execute HTML (/execute-html-online.html)
Execute Python (/execute-python-online.html)