In this part we will discuss how to add a new data node to existing running Hadoop cluster. Follow step by step guide in video tutorial.
Create new datanode
Ensure that Hadoop master node is up and running. Create a new virtual machine with Ubuntu as base image. We will use this machine as new data node.
Open SSH terminal for new data node and install java. Use below commands.
sudo apt-get update sudo apt-get install default-jdk
Create a new user to run Hadoop on datanode. Give sudo access to this user. Give same username as master node.
adduser hadoop_user #Set and confirm new user’s password at the prompt. #Add user to sudo group usermod -aG sudo hadoop_user #Change current user as hadoop_user or logout and log back in with hadoop_user sudo su - hadoop_user
Edit hosts file on both master node and datanode as shown below.
sudo vi /etc/hosts
Add/update below lines in hosts file for hadoop master node ip and datanode ip. Change ip address as per your virtual machine’s ip addresses.
10.128.0.8 hadoop-master 10.128.0.16 hadoop-datanode-1
Setup authorization between nodes
From master node, copy public key to datanode. This will allow master node to do passwordless SSH. To generate private/public key pair on master node, use “ssh-keygen” command as shown below.
#Copy public key ~/.ssh/id_rsa.pub from master node to data node. ssh-copy-id hadoop_user@hadoop-datanode-1
Above command will prompt one time password to login to datanode and copy public key from master node to data node.
If you face errors using “ssh-copy-id” command, then copy public key manually. Open ~/.ssh/id_rsa.pub file on master node and copy it’s content. Paste copied key manually in ~/.ssh/authorized_keys file on data node.
Change permission of ~/.ssh/authorized_keys file to 0660 on datanode.
chmod 0660 ~/.ssh/authorized_keys
Now try ssh from master node to data node. “hadoop_user” should be authenticated automatically using private key. Accept host-key finger print if prompted.
Setup Environment Variables
Set below environment variables in ~/.bashrc file on data node. Please note these environment variables are same as master node. Change JAVA_HOME path as per your java version.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar
Reload ~/.bashrc file to refresh above environment variables. Or log out and log back in to datanode.
On datanode, create directory for Hadoop and change owner/permissions as below.
sudo mkdir /usr/local/hadoop chown -R hadoop_user: /usr/local/hadoop chmod -R 755 /usr/local/hadoop
From master node, Copy(or sync) Hadoop home directory to data node. This will avoid need to download and configure Hadoop separately on datanode.
rsync -auvx /usr/local/hadoop hadoop-datanode-1:/usr/local/
On datanode, remove and recreate namenode and datanode directories. This is to ensure that data directories are empty on datanode.
rm -rf $HADOOP_HOME/hadoop_data/hdfs/namenode rm -rf $HADOOP_HOME/hadoop_data/hdfs/datanode mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode
Add new data node in slaves file on both master and data nodes.
Add below lines in salves file.
On new data node, edit masters file and ensure it contains “hadoop-master” hostname.
Add/update below line.
All configurations are complete now. On new data node, use below command to start HDFS.
hdfs --daemon start datanode
Check if datanode is started by issuing jps command.
jps Output: 3377 DataNode 3437 Jps
Use below command on master node to ensure new datanode is added.
hdfs dfsadmin -report Output: Live datanodes (2): Name: 10.128.0.16:9866 (hadoop-datanode1) Hostname: hadoop-datanode1 Decommission Status : Normal Configured Capacity: 51848519680 (48.29 GB) DFS Used: 42663936 (40.69 MB) Non DFS Used: 2937319424 (2.74 GB) DFS Remaining: 48851759104 (45.50 GB) DFS Used%: 0.08% DFS Remaining%: 94.22% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Sat Oct 10 04:24:30 UTC 2020 Last Block Report: Sat Oct 10 04:22:27 UTC 2020 Num of Blocks: 11 Name: 127.0.0.1:9866 (localhost) Hostname: ip6-loopback Decommission Status : Normal Configured Capacity: 51848519680 (48.29 GB) DFS Used: 41852928 (39.91 MB) Non DFS Used: 4550021120 (4.24 GB) DFS Remaining: 47239868416 (44.00 GB) DFS Used%: 0.08% DFS Remaining%: 91.11% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Sat Oct 10 04:24:29 UTC 2020 Last Block Report: Sat Oct 10 04:12:45 UTC 2020 Num of Blocks: 13
This concludes adding new data node to existing hadoop setup.