Part-2: Add new data node to existing Hadoop cluster

By | October 9, 2020

In this part we will discuss how to add a new data node to existing running Hadoop cluster. Follow step by step guide in video tutorial.

Create new datanode

Ensure that Hadoop master node is up and running. Create a new virtual machine with Ubuntu as base image. We will use this machine as new data node.

Open SSH terminal for new data node and install java. Use below commands.

sudo apt-get update
sudo apt-get install default-jdk

Create a new user to run Hadoop on datanode. Give sudo access to this user. Give same username as master node.

adduser hadoop_user
#Set and confirm new user’s password at the prompt.

#Add user to sudo group
usermod -aG sudo hadoop_user 

#Change current user as hadoop_user or logout and log back in with hadoop_user
sudo su - hadoop_user

Edit hosts file on both master node and datanode as shown below.

sudo vi /etc/hosts

Add/update below lines in hosts file for hadoop master node ip and datanode ip. Change ip address as per your virtual machine’s ip addresses.

10.128.0.8 hadoop-master
10.128.0.16 hadoop-datanode-1

Setup authorization between nodes

From master node, copy public key to datanode. This will allow master node to do passwordless SSH. To generate private/public key pair on master node, use “ssh-keygen” command as shown below.

#Copy public key ~/.ssh/id_rsa.pub from master node to data node.
ssh-copy-id hadoop_user@hadoop-datanode-1

Above command will prompt one time password to login to datanode and copy public key from master node to data node.

If you face errors using “ssh-copy-id” command, then copy public key manually. Open ~/.ssh/id_rsa.pub file on master node and copy it’s content. Paste copied key manually in ~/.ssh/authorized_keys file on data node.

Change permission of ~/.ssh/authorized_keys file to 0660 on datanode.

chmod 0660 ~/.ssh/authorized_keys

Now try ssh from master node to data node. “hadoop_user” should be authenticated automatically using private key. Accept host-key finger print if prompted.

ssh hadoop_user@hadoop-datanode-1

Setup Environment Variables

Set below environment variables in ~/.bashrc file on data node. Please note these environment variables are same as master node. Change JAVA_HOME path as per your java version.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar

Reload ~/.bashrc file to refresh above environment variables. Or log out and log back in to datanode.

source ~/.bashrc

Setup hadoop

On datanode, create directory for Hadoop and change owner/permissions as below.

sudo mkdir /usr/local/hadoop
chown -R hadoop_user: /usr/local/hadoop
chmod -R 755 /usr/local/hadoop

From master node, Copy(or sync) Hadoop home directory to data node. This will avoid need to download and configure Hadoop separately on datanode.

rsync -auvx /usr/local/hadoop hadoop-datanode-1:/usr/local/

On datanode, remove and recreate namenode and datanode directories. This is to ensure that data directories are empty on datanode.

rm -rf $HADOOP_HOME/hadoop_data/hdfs/namenode
rm -rf $HADOOP_HOME/hadoop_data/hdfs/datanode
mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode
mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode

Add new data node in slaves file on both master and data nodes.

vi $HADOOP_CONF_DIR/slaves

Add below lines in salves file.

hadoop-master
hadoop-datanode-1

On new data node, edit masters file and ensure it contains “hadoop-master” hostname.

vi $HADOOP_CONF_DIR/masters

Add/update below line.

hadoop-master

All configurations are complete now. On new data node, use below command to start HDFS.

hdfs --daemon start datanode

Check if datanode is started by issuing jps command.

jps

Output:
3377 DataNode
3437 Jps

Use below command on master node to ensure new datanode is added.

hdfs dfsadmin -report

Output:
Live datanodes (2):

Name: 10.128.0.16:9866 (hadoop-datanode1)
Hostname: hadoop-datanode1
Decommission Status : Normal
Configured Capacity: 51848519680 (48.29 GB)
DFS Used: 42663936 (40.69 MB)
Non DFS Used: 2937319424 (2.74 GB)
DFS Remaining: 48851759104 (45.50 GB)
DFS Used%: 0.08%
DFS Remaining%: 94.22%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Oct 10 04:24:30 UTC 2020
Last Block Report: Sat Oct 10 04:22:27 UTC 2020
Num of Blocks: 11


Name: 127.0.0.1:9866 (localhost)
Hostname: ip6-loopback
Decommission Status : Normal
Configured Capacity: 51848519680 (48.29 GB)
DFS Used: 41852928 (39.91 MB)
Non DFS Used: 4550021120 (4.24 GB)
DFS Remaining: 47239868416 (44.00 GB)
DFS Used%: 0.08%
DFS Remaining%: 91.11%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Oct 10 04:24:29 UTC 2020
Last Block Report: Sat Oct 10 04:12:45 UTC 2020
Num of Blocks: 13

This concludes adding new data node to existing hadoop setup.

Leave a Reply

Your email address will not be published. Required fields are marked *