Thursday, October 1, 2020

What is Commissioning and Decommissioning of Hadoop DataNode?

 Hadoop is an open-source framework that helps in distributed processing of huge data sets across the various clusters. Hadoop datanode is useful to store data in the Hadoop files system. This DataNode sends data or information to the NameNode where it is stored.  

The process is useful to check whether the data nodes are working at its best. It is also done to check the ability of the data nodes that they accept any data that requires change.

The commissioning of Hadoop datanode refers to adding a new datanode within the cluster. The decommissioning refers to the removal of the node from the cluster. The user can directly add or remove any DataNode in a large cluster.  But it may cause some disturbance. Moreover, if the user wants to scale the cluster, then he needs to follow commissioning and other steps given below,More info go through big data online course

Commissioning of Hadoop DataNode

To start the process of commissioning of Hadoop DataNode, here we need to update the existing file on both platforms i.e. Resource Manager and Namenode. Moreover, if the said file is not present, then we need to create an include file on both the Nodes.

The following stages require adding a new datanode to the cluster. 

  • At first, we need to add the IP address of the data node to the file into the specific parameter. Here, every entry of the file should be divided by a newline character or syntax within the node.

  • Next, we have to execute a certain command as the HDFS or a user with similar privileges given.

  • Then, the data node process starts and continues.

  • Finally, we need to check the Hadoop dfs admin report to confirm that the new host is connected well and the process will go on.

  1. Create NameNode and add an include file within hdfs-site.xml file.


[root@NN conf] cd  /etc/hadoop/conf

[root@NN conf] Vi hdfs-site.xml

-- Add this property

<property>

<name>dfs.hosts</name>

<value>/etc/hadoop/conf/includes</value>

</property>

  1. Next, we will add a new DataNode IP Address to it by updating the slave file on the NameNode.

[root@NN conf] Vi /etc/hadoop/conf/include

192.168.1.152

192.168.1.153

192.168.1.155


  1. Later, we need to edit “yarn-site.xml” file where the Resource Manager is running.

[root@RM conf]# vi yarn-site.xml

# Add this path of include file in node.

 <property>

    <name>yarn.resourcemanager.nodes.include-path</name>

    <value>/etc/hadoop/conf/includes</value>

  </property>


  1. Finally, we will update the include Resource Manager File.

[root@RM conf]# cat /etc/Hadoop/conf/includes

192.168.1.152

192.168.1.153

192.168.1.155


hadoop decom01 (1).png



Hadoop DataNode configuration

Here, we will set up a new Hadoop DataNode. The following process will take place here. In this stage, we will copy all the files of configuration existing in the name node and then we need to refresh these Nodes. The syntax is the following for the configuration. 

Moreover, we will start the services on DataNode now onwards.

[root@DN3 ~]# service hadoop-hdfs-datanode start

[root@DN3 ~]# service hadoop-yarn-nodemanager start

Now, we conclude the commissioning process of the new Hadoop DataNode. Here, we can check the Hadoop admin report. 

 sudo -u hdfs hadoop dfsadmin -report


Need for decommissioning and commissioning Hadoop datanode

The process denotes the addition and deletion of new datanodes within Hadoop datanode cluster. In this process, we cannot directly remove or add any datanode within a large cluster or a real-time cluster. Because it may cause a lot of disturbance within the cluster and it may fail also. 

Moreover, if the user wants to take a machine away for upgrading hardware, or to add one more node, decommissioning of nodes will be required. Besides, the user cannot shut down the Hadoop data node or slave-nodes immediately. Similarly, to scale the existing cluster or adding new data nodes to it without letting off the cluster, we need to perform commissioning. Both processes look similar but with some difference. 

Decommissioning of Hadoop Datanode

In this context, we will discuss the decommissioning process. To decommission the DataNode, we need to exclude the property need that exists at the NameNode side. Moreover, we need to perform decommissioning activity in non-peak hours only. Besides, any existing process running on this decommission node may fail.

The process follows similar steps as we discussed earlier in the commissioning process. There may be little difference in this process, as it includes the removal of the process. 

Here, it is to be noted that, there doesn’t exist include or exclude files at the same time. They are mutually exclusive. 

[root@NN conf] vi hdfs-site.xml

<property>

<name>dfs.hosts.exclude</name>

<value>/etc/hadoop/conf/excludes</value>

</property>

 

# create exclude file

[root@NN conf]# cat /etc/hadoop/conf/excludes

192.168.1.155


After this, we will update and check the file.

[root@RM conf]# vi yarn-site.xml

 <property>

    <name>yarn.resourcemanager.nodes.exclude-path</name>

    <value>/etc/hadoop/conf/excludes</value>

  </property>

-- Remove file

[root@RM conf]# cat /etc/hadoop/conf/excludes

192.168.1.155

Later, we need to refresh these nodes and end the process.

After finishing the decommissioning process, the decommissioned hardware resources can be shut down for maintenance and safety for further process.

Moreover, here there is an important role of Hadoop balancer. It is a kind of load balancer. This is an in-built property that makes sure that no data node should be much utilized in this process. While running the balancer utility, we can see that it checks whether some Hadoop datanode is under-utilized or over-utilized. Moreover, it balances the replication factor if needed. But we need to make sure that the Hadoop balancer should run only in the off-peak hours within a real cluster. It is because if we run the process during peak hours, it may cause a heavy load to networking. And it could transfer a large amount of data during the process which is unable to handle.

Important points in the process

There are some important points to be kept in mind while conducting the commissioning process.

  • Here, the most important thing is while performing commissioning it needs to make sure that the datanode that we are going to add contains everything. Here, this should be configured for Hadoop datanode only.

  • The second point is, that we should have to mention all the necessary data nodes addresses in the include file only. And not in the exclude file.

  • The third and last point is that we should run Hadoop Balancer. Besides, the Balancer attempts to provide balance to certain issues among the different data nodes. 




Thus, it is clear from the above details that the commissioning and decommissioning of the Hadoop Datanode that it is useful to add or remove nodes. This is mostly performed in the non-peak hours to exclude the burden from the cluster of having huge data. Moreover, the steps mentioned above in the article will help to start and end the process successfully. These data nodes usually store the data under a Hadoop file system securely and used for the purposes like above stated. Above some useful commands and syntax, we can conclude the project easily.

To get more useful insights from Hadoop and its DataNodes one can opt for big data online training from the industry experts like IT Guru. Learning from the platform may help to enhance skills and also help to plan for a better career. 


No comments:

Post a Comment