Part-1: How to install Hadoop HDFS on single node cluster

By | October 5, 2020

In this guide we will discuss how to install Hadoop HDFS on a single node cluster with Google Cloud Virtual Machine. Follow video tutorial below. To copy various commands, you can come back on this page.

Prepare new server

Create a new VM in google cloud with Ubuntu as base image. Create an instance with high RAM and CPU if possible to support various hadoop java processes.

Open SSH terminal in browser window. Create a new user as “hadoop_user” and give sudo access to this new user. Then login as hadoop_user.

adduser hadoop_user
#Set and confirm the new user’s password at the prompt.

#Add user to sudo group
usermod -aG sudo hadoop_user 

#Change current user as hadoop_user or logout and log back in with hadoop_user
sudo su - hadoop_user

Once logged in as hadoop_user, Update repositories using below command.

sudo apt-get update

Install latest version of JDK using below command. And check java version once installation is complete.

sudo apt-get install default-jdk
java -version

Take note of java version and find java install directory.

To find java install directory, use “which” command and “readlink” command as shown below.

which java
Output: /usr/bin/java

readlink /usr/bin/java
Output: /etc/alternatives/java

readlink /etc/alternatives/java
Output: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

Take a note of above java path. This will be JAVA_HOME path in next few steps.

Download and copy Hadoop binary

Go to Hadoop site and copy latest link of Hadoop .tar.gz file from any mirror site.

http://hadoop.apache.org/releases.html

Download Hadoop .tar.gz file using wget command in Downloads directory

mkdir ~/Downloads
cd ~/Downloads
wget http://apache.forsale.plus/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz

Extract .tar.gz file.

tar -zxvf hadoop-3.1.4.tar.gz

Move extracted hadoop directory to /usr/local/hadoop

sudo mv hadoop-3.1.4 /usr/local/hadoop

Set environment variables

Edit .bashrc file and set environment variables as shown. These variables are important.

vi ~/.bashrc

In ~/.bashrc file, add below lines at the end of file and save it.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar

Take a note of JAVA HOME variable, we have used path that we found earlier using “which” and “readlink” commands. You may need to change this path if you are using different version of java.

Reload environment variables. Or logout and log back in.

source ~/.bashrc

Configure Hadoop

Add Java home path in hadoop-env.sh file

vi $HADOOP_CONF_DIR/hadoop-env.sh

Add below line in this file.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Edit hadoop core site xml file.

vi $HADOOP_CONF_DIR/core-site.xml

Add below lines in core-site.xml. Make sure IP address is reflected for NameNode. Use Internal IP address or internal hostname of google cloud VM here. Do not use localhost here as you will need to access webpage from external ip address later. Or you can use 0.0.0.0 to bind service to all available ip addresses. 54310 is HDFS port.

<property>
<name>fs.defaultFS</name>
<value>hdfs://0.0.0.0:54310</value>
</property>

Edit yarn site settings and add lines as shown.

vi $HADOOP_CONF_DIR/yarn-site.xml

Add/update below lines in yarn-site.xml. Change ip 10.128.0.8 as per your VM’s internal ip or host name. Take a note of port 5349 which is set for accessing resource manager web page.

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>10.128.0.8</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.nodemanager.hostname}:5349</value>
</property>
<property>
<name>yarn.nodemanager.webapp.address</name>
<value>${yarn.nodemanager.hostname}:5249</value>
</property>

Edit mapred-site.xml file and Add below lines.

vi $HADOOP_CONF_DIR/mapred-site.xml

Add/update below configuration in mapred-site.xml.

<property>
<name>mapreduce.jobtracker.address</name>
<value>0.0.0.0:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Edit hdfs-site settings and add replication factor, name node and data node directories

vi $HADOOP_CONF_DIR/hdfs-site.xml

Add/update below configuration in hdfs-site.xml.

<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode </value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

Create data directories

Create hdfs data directories,

mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode
mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode

Create masters file, Add internal ip address/hostname line in this file.

vi $HADOOP_CONF_DIR/masters

Add below line. Replace ip address with your VM internal ip.

10.128.0.8

Create Slaves file. Add internal ip address/hostname in this file. For single node setup it’s just one line. If you were to do multi node setup, you would add multiple ip addresses in this file.

vi $HADOOP_CONF_DIR/slaves

Add below line. Replace ip address with your VM internal ip.

10.128.0.8

Now all configurations are complete. We need to format namenode. Use below command.

hdfs namenode -format

Setup authorization

Setup SSH Keys for current user to have passwordless access to localhost. Generate new SSH key and copy it to authorized keys file as shown

ssh-keygen

Add generated key to authorized_keys file.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Try ssh command now on internal ip address of VM and localhost. It should be able to login without password. Add fingerprint to known_host file when prompted by SSH command.

ssh localhost
ssh 10.0.0.8
ssh 0.0.0.0

Edit hosts file and bind localhost to internal ip address of VM.

sudo vi /etc/hosts

Add/Modify below line to point to internal IP address of VM. This way localhost will be bound to an ip address instead of 127.0.0.1. This is needed in case some of hadoop services start on localhost instead of internal ip address and we need to access them from outside VM.

10.128.0.8 localhost

Start Hadoop Services

All configurations are now complete. Start HDFS, YARN and History server using below commands.

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

Run below command to check status of hadoop services

jps

Output:
4000 JobHistoryServer
3650 ResourceManager
3010 NameNode
4243 Jps
3403 SecondaryNameNode
3789 NodeManager
3150 DataNode

You can also use below command to see HDFS status.

hdfs dfsadmin -report

Output...
Live datanodes (1):

Name: 127.0.0.1:9866 (ip6-loopback)
Hostname: ip6-loopback
Decommission Status : Normal
Configured Capacity: 51848519680 (48.29 GB)
DFS Used: 41852928 (39.91 MB)
Non DFS Used: 4550049792 (4.24 GB)
DFS Remaining: 47239839744 (44.00 GB)
DFS Used%: 0.08%
DFS Remaining%: 91.11%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Oct 10 04:16:53 UTC 2020
Last Block Report: Sat Oct 10 04:12:45 UTC 2020
Num of Blocks: 13

Test Hadoop

Let’s run few hadoop commands to test installation.

Browse hadoop root directory

hadoop fs -ls /

Create a test hadoop directory

hadoop fs -mkdir /test01
#Check if test01 exist
hadoop fs -ls /

create a test file and upload it to hadoop,

vi test.txt

Add some sample text in above file and upload it to hadoop root directory.

hadoop fs -put test.txt /
#List content of root directory to ensure test.txt exist
hadoop fs -ls /
#View content from HDFS
hadoop fs -cat /test.txt

Hadoop installation and test is now complete on single node cluster. You can access resource manager webpage at http://<VM_Public_IP>:5349. Node manager webpage at http://<VM_Public_IP>:5249. Ensure that ports 5349 and 5249 are open in Google VPC network -> Firewall.

Leave a Reply

Your email address will not be published. Required fields are marked *