Configuring Hadoop and Start Cluster Services using Ansible Playbook over AWS

Lalita Sharma
6 min readDec 19, 2020

→ Hadoop + Ansible + AWS = Configuration via AUTOMATION 🌠←

“Let us realize that: the privilege to work is a gift, the power to work is a blessing, the love of work is success”.✨

Hola Connections!😃🙌

👉In this Article, We will discuss about some very interesting concepts such as Ansible , Hadoop Cluster and AWS. Also, Automate the Hadoop Cluster configuration in both Master Node and Data Node through Ansible.

👉Here is a brief about the Task-Description📝:-

🔅Configure Hadoop and start cluster services using Ansible Playbook

✳️ What is ANSIBLE?

Ansible is an open-source automation tool, or platform, used for IT tasks such as configuration management, application deployment, intraservice orchestration, and provisioning. Automation simplifies complex tasks, not just making developers’ jobs more manageable but allowing them to focus attention on other tasks that add value to an organization. In other words, it frees up time and increases efficiency.

Some basic terms related to Ansible are:-

1.Controller Node❇️

Any machine with Ansible installed. You can run Ansible commands and playbooks by invoking the ansible or ansible-playbook command from any control node. You can use any computer that has a Python installation as a control node — laptops, shared desktops, and servers can all run Ansible. However, you cannot use a Windows machine as a control node. You can have multiple control nodes.

2.Target Node (or Managed Node)❇️

The network devices (and/or servers) you manage with Ansible. Managed nodes are also sometimes called hosts. Ansible is not installed on managed nodes.

3.Modules❇️

The units of code Ansible executes. Each module has a particular use, from administering users on a specific type of database to managing VLAN interfaces on a specific type of network device. You can invoke a single module with a task, or invoke several different modules in a playbook. Starting in Ansible 2.10, modules are grouped in collections.

4.Inventory❇️

A list of managed nodes. An inventory file is also sometimes called a ‘hostfile’. Your inventory can specify information like the IP address for each managed node. An inventory can also organize managed nodes, creating and nesting groups for easier scaling. To learn more about inventory, see the Working with Inventory section.

5.Ansible Playbooks❇️

Ordered lists of tasks, saved so you can run those tasks in that order repeatedly. Playbooks can include variables as well as tasks. Playbooks are written in YAML and are easy to read, write, share and understand. To learn more about playbooks, see Intro to playbooks.

Ansible architecture

✳️What is HDFS?

HDFS stands for Hadoop Distributed File System that stores large datasets in Hadoop. It runs on commodity hardware and is highly fault tolerant. HDFS follows Master/Slave architecture where a number of machines run on a cluster. The cluster comprises of a Namenode and multiple slave nodes known as DataNodes in the cluster. The Namenode stores meta-data, i.e., the number of Data Blocks, their replicas, locations, and other details. On the other hand, Data Node stores the actual data and performs read/write requests as per client’s request.

✳️What is AWS ?

Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform, offering over 175 fully featured services from data centers globally. Millions of customers — including the fastest-growing startups, largest enterprises, and leading government agencies — are using AWS to lower costs, become more agile, and innovate faster.

Let’s perform the task 👩‍💻—

👉 Firstly, Launched 3 EC2 instances on AWS:-

3 instances

Here, one node is controller node and another two target nodes in which one is NameNode and second one is DataNode.

👉 In Controller Node, I have created two directory named as “dn_files” and “nn_files” in which I have again created two files as ‘core-site.xml’ and ‘hdfs-site.xml’ for datanode and namenode configuration so that I directly copied these files in the respective target nodes, as you can see below —

CN — /root/ls

👉 Now, To configure controller node, firstly install ansible software then, create an ansible configuration file and inventory file.

📍Here, In folder ‘/etc/ansible’ , I have created one file for ansible configuration named as “ansible.cfg” and another one is for inventory named as “hosts”. I have also created an ansible playbook named as “hadoop.yml” to configured Hadoop-Cluster and with this, there are two more files also for the variable declaration , you can see below —

cd /etc/ansible

👉 In inventory file, we have to write the IP address of target nodes with their username , password and connection name. In AWS, we don’t have any password as here we used private ssh key to connect and login so here also , we have to write the key name with proper path.

hosts file(inventory)

👉 After inventory, go to ansible configuration file and update the inventory path.

ansible.cfg — ansible configuration file

👉 Now, it’s time to run the playbook , but before running the main playbook, always check the connectivity by using ping command: —

🔸# ansible all -m ping

Then, run ansible playbook and you will see the result in below pictures: —

🔸# ansible-playbook hadoop.yml

hadoop.yml playbook running….

✅ Firstly , NameNode will be configured as I have written the code for namenode first.

hadoop.yml

✅ After NameNode, DataNode will be configured and as you can see here in the pictures that it has installed both java jdk and hadoop software firstly, then copied the .xml files and then , format and start the hadoop services in both namenode and datanode.

hadoop.yml

✔️ So, it is processing and without giving any error, it has accomplished all tasks written in my ansible playbook‼️.

playbook run successfully!!

💫Here, my ansible playbook — “hadoop.yml” run successfully💫.

👉 Now, if you go and check to NameNode and DataNode , you will find that both the nodes has been configured and your hadoop cluster is working fine with proper connectivity and if you want to check , how many datanodes are available in your cluster, then run command —

# hadoop dfsadmin -report

So, it will show you the result as one datanode is available and connected as I have configured only one here. Thus, I have configured the hadoop-cluster by ansible automation on AWS🎉.

Hence, Task completed successfully💥!!

📌 For better understanding the code, visit my Github repo: — https://github.com/akshrasharma2666/Configure-Hadoop-Cluster-Using-Ansible-over-AWS-Task-12.1

Hope you like my article😊 and if you like it, don’t forget to give your response📝 below and make sure to press the clap👏👏 icon👇.

THANKYOU(*-*)🌻

--

--

Lalita Sharma

Aeromodeller|Passionate|Technoholic|Learner|Technical writer