Who Is The One(Master/Client) Uploading The File In HADOOP-CLUSTER ?

Lalita Sharma
6 min readOct 18, 2020

→Done Hadoop Cluster Configuration by using AWS Cloud ←

Let’s discuss about the Problem Statement:-

~Whenever client uploads the file ( for ex — f.txt) of size 32MB and the replication is 3.

✴️ Does client takes the entire data to master or does master provides the IP addresses of Datanodes so that client can upload the file to the Datanodes.

✴️ Question: Who is the one uploading the file?

✴️ Answer: Client gets the IP from Master and uploads the file to DataNode.

✴️ Prove this.

→ Pre-Requisite: Create an AWS Account and launch instance(or OS) using EC2 service and install hadoop and jdk in all the instances.

In my case , I have created 4 instances i.e. namenode(or master), Client(or hadoop client) and 2 datanodes(or slaves).

>>Firstly, Install “jdk” using command:-

#rpm -i -v -h jdk-8u171-linux-x64.rpm

>>Then, Install “hadoop package” using command:-

#rpm -i -v -h hadoop-1.2.1–1.x86_64.rpm — force

>>Check Version by using command:-

#java -version

#hadoop version

Step-1: Configure the Namenode(or MASTER)

/etc/hadoop → # ls
hdfs-site.xml
core-site.xml

Similarly, Configured the Datanode-1(or Slave-1) and Datanode-2(or Slave-2), but the only difference in both the configuration is — In master , “core-site.xml” , we give “IP 0.0.0.0” so that anyone can connect but as to connect the datanodes to this namenode , we give “IP 13.235.243.15(master public IP)” in datanodes “core-site.xml” file.

>>core-site.xml: This informs Hadoop daemon where NameNode runs on the cluster. It contains configuration settings of Hadoop core such as I/O settings that are common to HDFS & MapReduce.

>>hdfs-site.xml: This contains configuration settings of HDFS daemons (i.e. NameNode , DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS(mainly in client).

Step-2: Configure the Client(or hadoop Client)

core-site.xml
hdfs-site.xml

Here, I have given replication = 3 and size of replica = 32 MB = 33554432bytes in “hdfs-site.xml” file. (NOTE: If you don’t give any size , it by default takes 64 and replica depends on the datanodes)

Step-3: Clear caches in NameNode

clear caches

Command- #echo 3 > /proc/sys/vm/drop_caches

Step-4: Format the NameNode

format master

Command- #hadoop namenode -format

Step-5: Start Hadoop daemon in NameNode

start master and check it is started or not by using ‘jps’ command

Command to start master- #hadoop-daemon.sh start namenode

Command to stop master- #hadoop-daemon.sh stop namenode

Command- #jps

>>”jps” command is used to check all the Hadoop daemons like NameNode, DataNode, ResourceManager, NodeManager , etc. which are running on the machine.

Similarly, start the Hadoop daemons in both the datanodes and check both are connected or not by using ‘jps’ command.

DataNode-1
DataNode-2

NOTE: Client has no need to start/stop the instance and don’t use ‘jps’ command in client , it doesn’t work here. To Check , Is client connected or not , simply run command- #hadoop dfsadmin -report | less , it will show you the datanodes available.

Step-6: Check DataNodes connected to the NameNode

NameNode(or MASTER)

Run Command: #hadoop dfsadmin -report

NOTE: You can run the above command on any machine or any datanode/ master/ client , as now the hdfs cluster or hadoop cluster set up , so, all will shows the same output that how many datanodes are available in this cluster.

Step-7: Upload a file in the Hadoop Cluster

Hadoop-Client

File always uploaded by Hadoop-Client in the HDFS-Cluster in the root directory → ‘/’. So, before uploading file run command in client:-

# hadoop fs -ls /

Now, I am going to upload a file ‘we2.txt’ of size 328M .

Run command in Client instance to upload file :- #hadoop fs -put we2.txt /

→ While uploading the file, run one command in both DataNodes and in NameNode to check how replication creates and Is file uploaded to where — on DataNodes or on NameNode?

Run Command :- #tcpdump -i eth0 tcp port 22 -n (SSH works on port no. 22, though it can be configured)

#tcpdump -i eth0 tcp port 50010 -n -x (Trace data packets at port no. 50010)

NameNode (or MASTER)- transferring IPs to client

>>In above picture of NameNode , as client uploads file ,some packets recieved or send ,here ,also in the master because CLIENT is the one who send request to master and ask the IPs of datanodes then, master send the DATANODES IPs to the client and due to this internal procedure ,some data is showing here in the above screen too.

datanode -1

>>Here, In the above picture of DataNode-1 , Two IPs are shown

(DN1-IP : 172.31.42.116 and CLIENT-IP : 13.232.147.52)

which indicates that some part of file (uplaoding by client ) is directly transfer to this datanode -1 and other remaining part of file is transferring(or uploading) to the datanode-2 as shown below:-

datanode-2

Thus, CLIENT uploads file directly to the Datanodes ,not to the Master.

>>Now, If we talk about the REPLICATION:

datanode-1

As we know that no. of replica to be created and their size has been decided by the Client such as I have uploading a file named as “we2.txt” , when it is stripe by master ,just suppose in 2 parts — A or B then , 3 replica of A -part should be created and same with B too and for creating replica (or creating copy of A & B) of A and B part of file , data should be transferred from one datanode to another datanode ,where the corresponding replica has to be created, you can understand like this that if A is uploaded in datanode-1 then , its replica should be created in datanode 2 & 3 and that’s what shown in the above picture — Two IPs are shown in the above screen , one is of datanode1 and another one is of datanode2 , which is clearly showing that data is exchanging (or transferring) between both the datanodes.

You can also check it through the WEBUI at —

“<Master Public IP >:<port 50070>"

WEBUI

Thus, it shows 3 replication and size = 32MB which is fixed(or decided) by the Hadoop-Client.

CONCLUSION:- Whenever CLIENT uploads the file , the entire data or the file has directly uploaded to the DATANODES , not to the MASTER. Master only provides the IP addresses of Datanodes to the Client and Client is the one who always decide to change the replication factor and size of replica which has to be created in the Hadoop-Cluster.

HENCE, Proved.

Hope you find this article informative and beneficial. For more such valuable content, Don’t forget to press the clap icon below!

THANKYOU(*-*)

--

--

Lalita Sharma

Aeromodeller|Passionate|Technoholic|Learner|Technical writer