In a Hadoop Cluster, you are dealing with very large amount of data being written to or read from the disk using the network traffic. In order to handle the Hadoop data operation in a better way, you need to tell the Namenode that where and how the slave machines(Data nodes) are distributed. To achieve this you have to configure the Rack awareness script, so that Namenode will store the data in the nearest datanode available/reachable. Using the Rack topology script, Namenode will store 2 replica’s of data block in one/same rack and another block(3rd block) in different rack, assuming the default replication factor of hadoop which is 3. This is very useful in case of single machine or rack itself went down due to various issues. In case of rack failure 3rd replica will be available from a different rack. High availability can be easily achieved by configuring the rack topology script. Always it is advisable to have the same number of Slave/Data nodes in every rack to distribute the data evenly on all the nodes. Below is the Image explaining the block storage in Hadoop cluster using rack awareness.
Configure Rack Topology using Cloudera manager
Configuring Rack topology using cloudera manager is very simple and cloudera provides topology script(topology.py) and topology map file. Topology map file is nothing but containing all the nodes in the cluster with rack id in xml file(topology.map). In cloudera managed cluster, you will find both the files in /etc/hadoop/conf.cloudera.hdfs/ conf directory.
To the see the current active configuration in Cloduera Hadoop cluster, login to Namenode and list the latest process of Namenode directory in “/var/run/cloudera-scm-agent/process/“. use the below command, you can identify the latest process directory.
sudo ls -all /var/run/cloudera-scm-agent/process/ | grep -i namenode | sort | head -n1
you will be able to see the topology.map and topology.py file in the Namenode process directory. You can execute the topology.py by passing IP or Host-name as an argument, which produce output of rack-id of the data node. By default all the nodes are given as /default rack-id. Configure the rack-id for Data nodes, below are the steps to configure.
- Login to Cloudera manager and navigate to Hosts page, you will be seeing all the Hosts available in the Cluster given /default rack. [screen shot below]
- Select the node that you want to assign the rack and click assign rack from Action for selected and name it. rack-id is named in a linux directory structural way. Ex(/rack1, /rack2, /rack/rack/1, /rack/rack_2 etc). In this post am configuring 2 racks with 2 nodes each. Find the Screen shots below.
- provide the rack name you want to (/rack1, /rack2), should start with “/”.
- Once you have assigned the rack-id for your Slave node, below is the screen, that looks like.
After assigning the you will have stale configuration in Cloudera Manager, that requires configuration deployment and cluster restart. Go ahead and restart your cluster after deploying the client configuration from Cluster page. It is always a good practice to deploy client configuration before restart after making changes.
Check the topology.map file in configuration directory, which will have the content similar to below one.
<?xml version="1.0" encoding="UTF-8"?> <!--Autogenerated by Cloudera Manager--> <topology> <node name="cld-papp-drd01" rack="/rack1"/> <node name="10.204.47.16" rack="/rack1"/> <node name="cld-papp-drd02" rack="/rack1"/> <node name="10.204.47.80" rack="/rack1"/> <node name="cld-papp-drd03" rack="/rack2"/> <node name="10.204.47.17" rack="/rack2"/> <node name="cld-papp-dre01" rack="/default"/> <node name="10.204.47.11" rack="/default"/> <node name="cld-papp-drn01" rack="/default"/> <node name="10.204.47.10" rack="/default"/> <node name="cld-papp-drn02" rack="/default"/> <node name="10.204.47.74" rack="/default"/> <node name="cld-papp-drn03" rack="/rack2"/> <node name="10.204.47.12" rack="/rack2"/> </topology>
Now you can execute the script like below and passing IP or Host-name as an argument as we have discussed earlier in the post.
python /etc/hadoop/conf/topology.py 10.204.47.12
Find the execution and output below.
Hadoop allows you to configure your own topology script also. From Cloudera manager you can navigate to HDFS service configuration and search for “topology” you will find a property net.topology.script.file.name. You can provide the script full location and save it. Find the screen shot below.