Part-1: Hadoop HDFS Installation on single node cluster with Google cloud VM

In this guide we will discuss how to install Hadoop HDFS on a single node cluster with Google Cloud. Follow video tutorial. To copy various commands, you can come back to this page.

Create a new VM in google cloud as shown in video. Create instance as E2 Standard. Use Ubuntu as base image. Use Standard Persistent Disk of 10GB. You can set instance as preemptable to save cost on VM.

Open SSH terminal in browser window.

Update repositories using this command.

sudo apt-get update

Install latest version of JDK using this command

sudo apt-get install default-jdk

Once JDK is installed, check java version using this command

java -version

Take note of java version and find java install directory. This will be useful later to set JAVA_HOME path.

To find java install directory, use “which” command and “readlink” command as shown in video.

which java

Output: /usr/bin/java

readlink /usr/bin/java

Output: /etc/alternatives/java

readlink /etc/alternatives/java

Output: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java

Take a note of above path in blue. This will be JAVA_HOME path in next few steps.

Go to hadoop site and copy latest link of hadoop tar file.

http://hadoop.apache.org/releases.html

Download hadoop .tar.gz file using wget command in Downloads directory
wget http://apache.forsale.plus/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz

Extract .tar.gz file.

tar -zxvf hadoop-3.1.4.tar.gz

Move extracted hadoop directory to /usr/local

mv hadoop-3.1.4 /usr/local/hadoop

Edit .bashrc file and set environment variables as shown. These variables are important.

vi ~/.bashrc

In ~/.bashrc file add below lines at the end of file and save it.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_CLASSPATH=
/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar

Take a note of JAVA HOME variable, we have used path that we found earlier using “which” and “readlink” commands. You may need to change this path if you are using different version of java.

Reload environment variables –

source ~/.bashrc

Add Java home path in hadoop-env.sh file

vi $HADOOP_CONF_DIR/hadoop-env.sh

Add below line in this file.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Edit hadoop core site settings and add lines as shown. This will set default hdfs port.

vi $HADOOP_CONF_DIR/core-site.xml

Add below lines in core-site.xml. Make sure IP address is reflected for NameNode. This is a single node setup, so you can just use localhost.

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:54310</value>
</property>

Edit yarn site settings and add lines as shown.

vi $HADOOP_CONF_DIR/yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>

Edit mapred-site.xml file and Add below lines.

vi $HADOOP_CONF_DIR/mapred-site.xml

<property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

Edit hdfs-site settings and add replication factor, name node and data node directories

vi $HADOOP_CONF_DIR/hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode </value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

Create hdfs data directories,
mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode
mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode  

Create masters file, Add localhost line in this file, which is ip address or host name for master node. For single node setup it’s just loclhost.
touch $HADOOP_CONF_DIR/masters
vi $HADOOP_CONF_DIR/masters

Add below line or ip address
localhost

Create Slaves file. Add localhost line in this file which is ip address or hostname for data nodes.
vi $HADOOP_CONF_DIR/slaves

Add below line or ip address
localhost


Now all configurations are complete. We need to format namenode. Use below command.
hdfs namenode -format

Setup SSH Keys for current user to have passwordless access to localhost.
Generate new SSH key and copy it to authorized keys file as shown
ssh-keygen
Add generated key to authorized_keys file.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Try ssh command now. It should be able to login without password.
ssh localhost

Start HDFS, YARN and History server
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

Run below command to check status of hadoop services
jps

Let’s run few hadoop commands to test installation.
Browse hadoop root directory
hadoop fs -ls /

Create a test hadoop directory
hadoop fs -mkdir /test01
hadoop fs -ls /

Hadoop installation is now complete.