Who Is The One(Master/Client) Uploading The File In HADOOP-CLUSTER ?
→Done Hadoop Cluster Configuration by using AWS Cloud ←
Let’s discuss about the Problem Statement:-
~Whenever client uploads the file ( for ex — f.txt) of size 32MB and the replication is 3.
✴️ Does client takes the entire data to master or does master provides the IP addresses of Datanodes so that client can upload the file to the Datanodes.
✴️ Question: Who is the one uploading the file?
✴️ Answer: Client gets the IP from Master and uploads the file to DataNode.
✴️ Prove this.
→ Pre-Requisite: Create an AWS Account and launch instance(or OS) using EC2 service and install hadoop and jdk in all the instances.
In my case , I have created 4 instances i.e. namenode(or master), Client(or hadoop client) and 2 datanodes(or slaves).
>>Firstly, Install “jdk” using command:-
#rpm -i -v -h jdk-8u171-linux-x64.rpm
>>Then, Install “hadoop package” using command:-
#rpm -i -v -h hadoop-1.2.1–1.x86_64.rpm — force
>>Check Version by using command:-
Step-1: Configure the Namenode(or MASTER)
Similarly, Configured the Datanode-1(or Slave-1) and Datanode-2(or Slave-2), but the only difference in both the configuration is — In master , “core-site.xml” , we give “IP 0.0.0.0” so that anyone can connect but as to connect the datanodes to this namenode , we give “IP 126.96.36.199(master public IP)” in datanodes “core-site.xml” file.
>>core-site.xml: This informs Hadoop daemon where NameNode runs on the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.
>>hdfs-site.xml: This contains configuration settings of HDFS daemons (i.e. NameNode , DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS(mainly in client).
Step-2: Configure the Client(or hadoop Client)
Here, I have given replication = 3 and size of replica = 32 MB = 33554432bytes in “hdfs-site.xml” file. (NOTE: If you don’t give any size , it by default takes 64 and replica depends on the datanodes)
Step-3: Clear caches in NameNode
Command- #echo 3 > /proc/sys/vm/drop_caches
Step-4: Format the NameNode
Command- #hadoop namenode -format
Step-5: Start Hadoop daemon in NameNode
Command to start master- #hadoop-daemon.sh start namenode
Command to stop master- #hadoop-daemon.sh stop namenode
>>”jps” command is used to check all the Hadoop daemons like NameNode, DataNode, ResourceManager, NodeManager , etc. which are running on the machine.
Similarly, start the Hadoop daemons in both the datanodes and check both are connected or not by using ‘jps’ command.
NOTE: Client has no need to start/stop the instance and don’t use ‘jps’ command in client , it doesn’t work here. To Check , Is client connected or not , simply run command- #hadoop dfsadmin -report | less , it will show you the datanodes available.
Step-6: Check DataNodes connected to the NameNode
Run Command: #hadoop dfsadmin -report
NOTE: You can run the above command on any machine or any datanode/ master/ client , as now the hdfs cluster or hadoop cluster set up , so, all will shows the same output that how many datanodes are available in this cluster.
Step-7: Upload a file in the Hadoop Cluster
File always uploaded by Hadoop-Client in the HDFS-Cluster in the root directory → ‘/’. So, before uploading file run command in client:-
# hadoop fs -ls /
Now, I am going to upload a file ‘we2.txt’ of size 328M .
Run command in Client instance to upload file :- #hadoop fs -put we2.txt /
→ While uploading the file, run one command in both DataNodes and in NameNode to check how replication creates and Is file uploaded to where — on DataNodes or on NameNode?
Run Command :- #tcpdump -i eth0 tcp port 22 -n (SSH works on port no. 22, though it can be configured)
#tcpdump -i eth0 tcp port 50010 -n -x (Trace data packets at port no. 50010)
>>In above picture of NameNode , as client uploads file ,some packets recieved or send ,here ,also in the master because CLIENT is the one who send request to master and ask the IPs of datanodes then, master send the DATANODES IPs to the client and due to this internal procedure ,some data is showing here in the above screen too.
>>Here, In the above picture of DataNode-1 , Two IPs are shown
(DN1-IP : 172.31.42.116 and CLIENT-IP : 188.8.131.52)
which indicates that some part of file (uplaoding by client ) is directly transfer to this datanode -1 and other remaining part of file is transferring(or uploading) to the datanode-2 as shown below:-
Thus, CLIENT uploads file directly to the Datanodes ,not to the Master.
>>Now, If we talk about the REPLICATION:
As we know that no. of replica to be created and their size has been decided by the Client such as I have uploading a file named as “we2.txt” , when it is stripe by master ,just suppose in 2 parts — A or B then , 3 replica of A -part should be created and same with B too and for creating replica (or creating copy of A & B) of A and B part of file , data should be transferred from one datanode to another datanode ,where the corresponding replica has to be created, you can understand like this that if A is uploaded in datanode-1 then , its replica should be created in datanode 2 & 3 and that’s what shown in the above picture — Two IPs are shown in the above screen , one is of datanode1 and another one is of datanode2 , which is clearly showing that data is exchanging (or transferring) between both the datanodes.
You can also check it through the WEBUI at —
“<Master Public IP >:<port 50070>"
Thus, it shows 3 replication and size = 32MB which is fixed(or decided) by the Hadoop-Client.
CONCLUSION:- Whenever CLIENT uploads the file , the entire data or the file has directly uploaded to the DATANODES , not to the MASTER. Master only provides the IP addresses of Datanodes to the Client and Client is the one who always decide to change the replication factor and size of replica which has to be created in the Hadoop-Cluster.
Hope you find this article informative and beneficial. For more such valuable content, Don’t forget to press the clap icon below!