MapReduce: MapReduce - Framework Related

MapReduce : It’s a framework for processing the data residing on HDFS. Distributes the task (map/reduce) across several machines. Consist of typically 5 phases:
• Map
• Partitioning
• Sorting
• Shuffling
• Reduce

MapReduce Terminology :

• What is job? Complete execution of mapper and reducers over the entire data set.

• What is task? Single unit of execution (map or reduce task), Execution of map or reduce over a portion of data typically a block of data.

• What is task attempt ? Instance of an attempt to execute a task (map or reduce task). If task is failed working on particular portion of data, another task will run on that portion of data on that machine itself. If a task fails 4 times, then the task is marked as failed and entire job fails. Make sure that atleast one attempt of task is run on different machine.

• How many tasks can run on portion of data? Maximum 4, If “speculative execution” is ON, more task will run.

• What is “failed task”? Task can be failed due to exception, machine failure etc. A failed task will be re-attempted again (4 times).

• What is “killed task”? If task fails 4 times, then task is killed and entire job fails. Task which runs as part of speculative execution will also be marked as killed.

Input Split :

Portion or chunk of data on which mapper operates, Typically is equal to one block of data (dfs.block.size).

Each mapper works only on one input split

Consider your block size is 64 MB and your Data is 1 GB then Number of splits is 16. If your Block remains same and your Split size is 128MB then number of splits is 8 i.e 2 HDFS blocks will be a split i.e your one Mapper will run on 128 MB of data.

How to control Input Split size? Generally input split is equal to block size (64MB), if you want mapper to work only on 32 MB/128 MB of a block data.

Using 3 Properties we can control split size :
• Mapred.min.split.size ( default 1)
• Mapred.max.split.size (default LONG.MAX_VALUE)
• Dfs.block.size ( default 64 MB)
Simple Formal to set size split : Max(minSplitSize,min(maxSplitSize,blockSize).

We know task can be a Mapper or Reducer task.

Map task executes on input splits and
Reduce task executes on intermediate output generated by map task as showed in below figure.

Mapper :

1. Mapper is the first phase of MapReduce job.
2. Mapper works on 1 split of data typically 1 block (By default 1 split is 1 block, if your split is not same as your block size then there is an overhead that will be added to your mapreduce processing to divide your data which is present in HDFS(internally stored as blocks) into splits, this overhead will not be there if your split is same as block. Then you should have question why we have a choise of adjusting split size and how split size affect your M/R performance. I will cover this topic in a seperate feature blog.)
3. MapReduce framework ensures that map task is run closer to the data for implementing data Localization.
4. Several map tasks runs parallel on different machines and each working on different portions of data.
5. Mapper read's data from one split in the form of (in_key,in_value) and emits output in the form of (out_key1,out_value1).

6. Mapper can emit any number of output (key,value) pairs that depends on your mapper logic. i.e 0,1,0r n.
7. Mapper output will be created in local file system on that particular data node where mapper is running instead of creating in HDFS becuase mapper output is temporary in nature and replication is not required.

How mapper read's data from split ?

When writing a mapper function it is not required to deal with how to read data from split. This will be taken care by mapreduce framework class called RecordReader. You have to just tell RecordReader about your data how it look's,Wether your data is TextInputFormat/Sequential input format/Nline input format etc in your driver code. RecordReader is resposible of providing (in_key,in_value) to your mapper.

7. RecordReader call your map function which is present in your mapper for each record in your split.
           a. Input Split consist of records,For each record in the input split, map function will be called.
           b. Each record will be sent as key –value pair to map function.
           c. When we write map function keep only one record in mind.
           d. Mapper does not keep the state of whether how many records it has processed or how any           records will appear.

Reducer :

1. Reducer runs when all mapper tasks are completed.
2. After mapper phase , all the intermediate values for a given intermediate keys is grouped together and form a list.

3. Reducer operates on Key, and List of Values
4. When writing reducer keep ONLY one key and its list of value in mind.
5. Reduce operates ONLY on one key and its list of values at a time.
6. For better load balancing you should have more than one reducer.User have control to increase or reduce number of reducers.
7. Reducer does not work based on data localization as intermediate data generated by your mappers is been sent where reducer logic is running i.e data is been moved to where your logic present.

After we understand some basics of mapreduce , See my blog on MapReduce - Basic Programming for how to write a mapper, reducer and driver class.

http://hadoopmapred.blogspot.in/2014/01/mapreduce-basic-programming.html

MapReduce

Friday, 10 January 2014

MapReduce - Framework Related

No comments:

Post a Comment