0% found this document useful (0 votes)
8 views

ProgrammingHadoop ApacheConUS08

The document discusses Hadoop MapReduce, an open source framework for distributed processing of large datasets across clusters of computers. It provides an overview of the key components of Hadoop including the distributed file system (HDFS) which stores large files across nodes, and the MapReduce programming model which distributes computations and aggregates results in parallel. The document describes how MapReduce works like a Unix pipeline to process large amounts of data efficiently and its features like locality optimizations, fault tolerance, and automatic re-execution of failed tasks.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ProgrammingHadoop ApacheConUS08

The document discusses Hadoop MapReduce, an open source framework for distributed processing of large datasets across clusters of computers. It provides an overview of the key components of Hadoop including the distributed file system (HDFS) which stores large files across nodes, and the MapReduce programming model which distributes computations and aggregates results in parallel. The document describes how MapReduce works like a Unix pipeline to process large amounts of data efficiently and its features like locality optimizations, fault tolerance, and automatic re-execution of failed tasks.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Programming

Hadoop Map-Reduce
Programming, Tuning & Debugging

Arun C Murthy
Yahoo! CCDI
acm@yahoo-inc.com
ApacheCon US 2008
Existential angst: Who am I?

• Yahoo!
– Grid Team (CCDI)

• Apache Hadoop
– Developer since April 2006
– Core Committer (Map-Reduce)
– Member of the Hadoop PMC
Hadoop - Overview

• Hadoop includes:
– Distributed File System - distributes data
– Map/Reduce - distributes application
• Open source from Apache
• Written in Java
• Runs on
– Linux, Mac OS/X, Windows, and Solaris
– Commodity hardware
Distributed File System

• Designed to store large files


• Stores files as large blocks (64 to 128 MB)
• Each block stored on multiple servers
• Data is automatically re-replicated on need
• Accessed from command line, Java API, or C API
– bin/hadoop fs -put my-file hdfs://node1:50070/foo/bar
– Path p = new Path(“hdfs://node1:50070/foo/bar”);
FileSystem fs = p.getFileSystem(conf);
DataOutputStream file = fs.create(p);
file.writeUTF(“hello\n”);
file.close();
Map-Reduce

• Map-Reduce is a programming model for efficient


distributed computing
• It works like a Unix pipeline:
– cat input | grep | sort | unique -c | cat > output
– Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
– Streaming through data, reducing seeks
– Pipelining
• A good fit for a lot of applications
– Log processing
– Web index building
Map/Reduce features

• Fine grained Map and Reduce tasks


– Improved load balancing
– Faster recovery from failed tasks

• Automatic re-execution on failure


– In a large cluster, some nodes are always slow or flaky
– Introduces long tails or failures in computation
– Framework re-executes failed tasks
• Locality optimizations
– With big data, bandwidth to data is a problem
– Map-Reduce + HDFS is a very effective solution
– Map-Reduce queries HDFS for locations of input data
– Map tasks are scheduled local to the inputs when possible
Mappers and Reducers

• Every Map/Reduce program must specify a Mapper


and typically a Reducer
• The Mapper has a map method that transforms input
(key, value) pairs into any number of intermediate
(key’, value’) pairs
• The Reducer has a reduce method that transforms
intermediate (key’, value’*) aggregates into any number
of output (key’’, value’’) pairs

You might also like