Thursday, September 24, 2020

Performance tuning through Hadoop Mapreduce optimization

 Performance tuning in Hadoop can help maximize efficiency of the Hadoop cluster. This article on performance tuning with Hadoop MapReduce will provide you with ways to boost the efficiency of your Hadoop cluster and get the best results from your Hadoop programming. It will cover  essential concepts such as Hadoop Memory Tuning, Hadoop Map Disk Spill, tuning mapper tasks, Big Data Hadoop Speculative execution and many other related concepts for performance tuning of Hadoop MapReduce,More info go through big data online course .



Hadoop MapReduce Performance Tuning 

Hadoop performance tuning will help you improve the efficiency of your Hadoop cluster and make it possible to achieve the best results when programming Hadoop in big data enterprises. To do the same, the process given below must be repeated until the desired output is attained in the optimal way.

Run Job – > Bottleneck Identification-> Bottleneck Address.

The first step in hadoop performance tuning is to run a Hadoop job, identify and address the bottlenecks using the methods below to get the highest performance. You need to repeat the above phase until you reach a level of efficiency.

 Hadoop MapReduce Performance Tuning tools

Here let us discuss ways to improve the performance tuning of the Hadoop MapReduce. You may categorize those ways into two categories.

Performance tuning based on hadoop run-time parameters.

Broad performance tuning for  hadoop.

On the basis of these two categories, let 's discuss how to improve the performance of Hadoop cluster.

Tuning Hadoop Run-time Parameters 

There are several options that Hadoop offers for performance tuning on the CPU , memory, disk and network. Most Hadoop tasks are not confined by the CPU, which is often considered to optimize memory and disk spill use. In this Hadoop output tuning let's get into the specifics in the Tuning Hadoop Run-time parameters.

  1. Memory Tuning 

In Hadoop MapReduce performance tuning the most basic and common rule for memory tuning is: using as much memory as possible without triggering swapping. The role memory parameter is mapred.child.java.opts which can be inserted into your config file.


You can also use Ganglia, Cloudera manager, or Nagios to control the memory usage on the server for better memory performance.

  1. Minimize the IO of the Map Disc Spill 

Disk is typically the bottleneck output in Hadoop. There are many parameters you can change to minimize spilling like: mapper performance compression use of 70 percent heap memory ion mapper in Hadoop for spill buffer.

  1. Tuning Mapper 

Tasks Unlike reducer tasks, the number of mapper tasks is set implicitly. The most common way for the mapper to conduct hadoop tuning is to control the amount of mapper and the size of each task. Hadoop breaks the file into smaller chunks when dealing with large datasets, so that the mapper can run it in parallel. Nevertheless, initializing a new mapper job usually takes a few seconds to minimize which is also an overhead. The suggestions for the same are as follows: Reuse jvm task Target to map tasks that run 1-3 minutes each. For this, if the average running time of the mapper is less than one minute, increase the mapred.min.split.size, assign less mappers in the slot, thus reducing the initializing overhead mapper.

Using Combine file input format for smaller bunch of files.

 Tuning Application Specific Performance 

Let 's explore the tips for enhancing Hadoop's application-specific performance.

  1. Minimize your Mapper Output 

Reducing the mapper output will greatly boost overall efficiency as it is sensitive to the IO, network IO, and memory sensitivity shuffle step disk.

Below are the suggestions for doing this operation.

  • Filter the records on the mapper side instead of the reducer side.

  • In Map Reduce, using minimal data to shape your map output key and map output value.

  • Hadoop Output Format-Output Format forms in Hadoop Mapreduce

  1. Balancing the Loading

 Unbalanced reducer tasks from Reducer produces another performance problem. Some reductors take most of the mapper output and have run exceptionally long compared to other reducers.

Below are the methods for doing the same thing.

  • Implement a better hash function in class Partitioner.

  • Write a preprocess job to use Multiple Outputs to isolate the keys. Use another map-reduce job then to process the special keys that trigger the problem.

  •  Reduce Intermediate data in Hadoop with Combiner Implement a data reduction combiner which allows faster data transfer.

  1. Speculative execution

 This affects the Hadoop MapReduce jobs when tasks take a long time to complete execution.

To enable speculative execution, you must set the configuration parameters 'mapreduce.map.tasks.speculative.execution' and 'mapreduce.reduce.tasks.speculative.exection' to true. This will reduce work execution time if the progress of the task is slow due to unavailability of the memory.

This was all about the combiner Hadoop Mapreduce.

LZO compression utilization

 This is often a good idea for Intermediate results. Any Hadoop job generating an undeniable amount of map performance would benefit from intermediate data compression with LZO.

While LZO adds a little bit of overhead to the CPU, it saves time during shuffle by reducing the amount of disk IO.

To allow LZO compressionset mapred.compress.map.output to real. Proper tuning of the number of MapReduce tasks In Hadoop MapReduce job, if each task takes 30-40 seconds or more, the number of tasks will then be reduced. Then JVM has to be initialized. Therefore you need to de-initialize JVM after processing (mapper / reducer). Besides those JVM tasks are very costly. Suppose a scenario where mapper only runs a process for 20-30 seconds. You need JVM to start / initialize / stop for that. That may take a considerable amount of time. So, running the function for at least 1 minute is strictly advisable.

When there is more than 1 TB of input to a task. Then you should consider increasing input data set block size to 256 M or even 512M. So, there would be a reduced number of activities through using the command. 

Hadoop distcp – Hdfs.block.size=$[256 * 1024 * 1024] 

You can adjust the block size because you know that each task is running for at least 30-40 seconds. You should increase the number of mapper tasks within the cluster to some multiple of the number of mapper slots.

Do not run too many tasks to needed-for most jobs. The number of tasks that reduce is equal to or slightly less than the number of slots in the cluster.

Combiner between Mapper and Reducer

If algorithm requires some kind of computational aggregates, then you can use a Combiner. Combiner does some aggregation before the reducer hits the results. The system runs Hadoop MapReduce smartly combined to reduce the amount of data to be written to the disk. And that data must be transferred between the computational stages of the Map and Reduce.

Usage of the most appropriate and lightweight type writable for data

 Big data users unnecessarily use the Text Writable form to migrate from Hadoop Streaming to Java MapReduce. Text can be useful. Conversion of numeric data into and from UTF8 strings is inefficient. And can make up a substantial portion of CPU time actually.

Reusing Writables 

Many Hadoop MapReduce users make one very common mistake in allocating a new Writable object from a mapper / reducer for every output. Assume, for example, the implementation of word-count mappers as follows: 

public void map (...) {...

For (String word: words) 

{

output.collect(new Text(word),

 new IntWritable

(1));

This execution causes thousands of short-lived objects to be allocated. While Java garbage collector does a reasonable job at handling this, writing is more efficient.

      class MyMapper ... {

Text wordText = new Text();

IntWritable one = new IntWritable(1);

public void map(...) {

... for (String word: words)

{

wordText.set(word);

output.collect(word, one); }

}

}

Conclusion

 Therefore, there are many Hadoop MapReduce  optimization techniques that help you maximize MapReduce jobs. Including using a combiner between mapper and reducer, use of LZO compression, careful tuning of the number of MapReduce tasks, use of writable reusability. If you consider a particular strategy for optimizing MapReduce work you can employ the following techniques. You can learn more about Hadoop through big data hadoop training


2 comments: