Friday, 12 April 2013

Programming model | Hadoop Tutorial pdf

Programming model

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associatedwith the same intermediate key I and passes them to the Reduce function. The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation. The intermediate values are supplied to the user's reduce function via an iterator. This allows us to handle lists of values that are too large to fit in memory.

MAP

map (in_key, in_value) -> (out_key, intermediate_value) list

Example: Upper-case Mapper
let map(k, v) = emit(k.toUpper(), v.toUpper())
(“foo”, “bar”) --> (“FOO”, “BAR”)
(“Foo”, “other”) -->(“FOO”, “OTHER”)
(“key2”, “data”) --> (“KEY2”, “DATA”)

REDUCE

reduce (out_key, intermediate_value list) -> out_value list

Example: Sum Reducer
let reduce(k, vals)
sum = 0
foreach int v in vals:
sum += v
emit(k, sum)
(“A”, [42, 100, 312]) --> (“A”, 454)
(“B”, [12, 6, -2]) --> (“B”, 16)
Example2:-
Counting the number of occurrences of each word in a large collection of documents. The user would write code similar to the following pseudo-code:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The reduce function sums together all counts emitted for a particular word.
In addition, the user writes code to _ll in a mapreduce specification object with the names of the input and output _les, and optional tuning parameters. The user then invokes the MapReduce function, passing it the specification object. The user's code is linked together with the MapReduce library (implemented in C++)
Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication.
This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.
The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. As a reaction to this complexity, Google designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.

No comments: