Friday, July 18, 2014

Sizing of Name Node Ram and Physical Memory for Data Nodes

Recently while working with one of client, I was asked to advice about RAM requirement for Name Node and Physical storage capacity for Data Nodes. This is one of the questions, I am asked repeatedly. To solve the issue once for all, I like to formalize the answer in terms of mathematical formula, so ambiguity can be take out from answer.

The associated Excel file is also created and available at scribd.

Hadoop NameNode RAM and Physical Memory for DataNodesSizing

Thursday, July 17, 2014

MapReduce – The Model

In the map-reduce programming model, work is divided into two phases: a map phase and a reduce phase. Both of these phases work on key-value pairs. What these pairs contain is completely up to you: they could be URLs paired with counts of how many pages link to them, or movie IDs paired with ratings. It all depends on how you write and set up your map-reduce job.
A map-reduce program typically acts something like this:
  1. Input data, such as a long text file, is split into key-value pairs. These key-value pairs are then fed to your mapper. (This is the job of the map-reduce framework.)
  2. Your mapper processes each key-value pair individually and outputs one or more intermediate key-value pairs.
  3. All intermediate key-value pairs are collected, sorted, and grouped by key (again, the responsibility of the framework).
  4. For each unique key, your reducer receives the key with a list of all the values associated with it. The reducer aggregates these values in some way (adding them up, taking averages, finding the maximum, etc.) and outputs one or more output key-value pairs.
  5. Output pairs are collected and stored in an output file (by the framework).

What makes this model so good for parallel programming should be apparent from the figure above: each key-value pair can be mapped or reduced independently. This means that many different processors, or even machines, can each take a section of the data and process it separately—a classic example of data parallelism. The only real step where synchronization is needed is during the collecting and sorting phase, which can be handled by the framework (and, when done carefully, even this can be parallelized).
So, when you can fit a problem into this model, it can make parallelization very easy. What may seem less obvious is how a problem can be solved with this model in the first place.

Real life MapReduce

Tree of Maps: