Part-1 : Hadoop HDFS Installation on single node cluster with Ubuntu docker image

In this guide we will discuss how to install Hadoop HDFS on a single node ubuntu machine running on a docker. You can follow below guide with a video tutorial.

Assuming you have docker installed on your machine. We will first get docker image for ubuntu and run it.

docker pull ubuntu



Run ubuntu with below configurations. Please note various port mappings that we may need later for accessing various hadoop applications.

docker run  --name ubuntu_hadoop --hostname=quickstart.hadoop --privileged=true -t -i -p 8888:8888 -p 7180:7180 -p 30022:22 -p 50070:50070 -p 50090:50090 -it ubuntu:latest

If you are doing SSH to server where docker is installed, make sure below line exist on hosts file on computer from where you are doing SSH. This will help you to access hadoop webpages from that computer. If you are accessing webpages directly from machine where docker is installed then you don’t need this step.

For Windows, open C:\Windows\System32\drivers\etc\hosts (Add these lines)

127.0.0.1 localhost
127.0.0.1 quickstart.hadoop

If you are using proxy for browsing, when using http://quickstart.hadoop:8082 etc webpages make sure proxy settings are turned off in browser for above hosts file to work. Use firefox which allows turning off proxy settings.

If you are behind proxy, to enable proxy for apt-get run below command, else ignore below.

echo "Acquire::http::proxy \"http://10.42.0.1:3333\";" >> /etc/apt/apt.conf
apt-get update

Install various applications that we will need.

Install VI editor because it’s not available in ubuntu docker image.

apt-get install vim

Install Java Development Kit,

apt-get install default-jdk

Install Open SSH server

apt-get install inetutils-ping
apt-get install openssh-server

Configure SSH server

vi /etc/ssh/sshd_config

add/update below line which allows “root” login. We will just use “root” account for this single node installation. You could also create a new hadoop specific user that has sudo access, but it’s optional.

PermitRootLogin yes

Then start ssh server with below command

/etc/init.d/ssh start

Or

service ssh restart

set root’s password with “passwd” command

passwd

It is a good idea to setup passwordless SSH. This will avoid password prompts while starting hadoop services. You can follow below tutorial on how to set passwordless SSH.
http://hadooptutorials.info/2017/09/30/set-passwordless-ssh-for-hadoop/

Install Hadoop

Install WGET in docker image

apt-get install wget

If you are behind proxy then you need to set proxy settings for wget,

vi ~/.wgetrc

Add below lines in this file at the end.

use_proxy=yes
http_proxy=http://10.42.0.1:3333

Get latest links of hadoop .tar.gz from below site. Click on binary link and copy .tar.gz link
http://hadoop.apache.org/releases.html

Download hadoop tar.gz file from hadoop site,

mkdir -p ~/Downloads
cd ~/Downloads
wget http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz

Extract file in a folder,

tar zxvf ~/Downloads/hadoop-2.8.1.tar.gz

move extracted hadoop folder to /usr/local

mv ~/Downloads/hadoop-2.8.1 /usr/local/hadoop

Edit .bashrc file and set below environment variables. These variables are important.

vi ~/.bashrc

In ~/.bashrc file add below lines at the end of file and save it

export JAVA_HOME=/usr
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar

reload environment variables

source ~/.bashrc

Setup Hadoop environment

vi $HADOOP_CONF_DIR/hadoop-env.sh

Add below lines in hadoop-env.sh file

export JAVA_HOME=/usr

Setup hadoop core site settings,

vi $HADOOP_CONF_DIR/core-site.xml

Add below lines in core-site.xml. Make sure IP address is reflected for NameNode. This is a single node setup, so you can just use localhost, or 10.42.0.1 which is a local ip.

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:54310</value>
  </property>

Setup hadoop yarn site settings

vi $HADOOP_CONF_DIR/yarn-site.xml

add/update below lines,

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>localhost</value>
</property>



Setup Map Reduce settings. Use template that is provided with installation.

mv $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml

Edit mapred-site.xml

vi $HADOOP_CONF_DIR/mapred-site.xml

Add/update below lines,

  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>localhost:54311</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>

Set replication factor. Should be according to number of name node and data nodes.

vi $HADOOP_CONF_DIR/hdfs-site.xml

Add/update below lines to set data and namenode directories,

  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
  </property>
<property>
<name>dfs.datanode.data.dir</name>
<value> file:///usr/local/hadoop/hadoop_data/hdfs/datanode </value>
</property>

<property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
</property>

Create data directories,

mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode
mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanode  

Create masters file,

touch $HADOOP_CONF_DIR/masters

vi $HADOOP_CONF_DIR/masters

Add below line in this file, which is ip or host name for master node. For single node setup it’s just loclhost.

localhost

Create Slaves file

vi $HADOOP_CONF_DIR/slaves

Add below line in this file which is ip or hostname for data nodes. For single node setup it’s just localhost.

localhost

Now all configurations are complete. We need to format namenode.

hdfs namenode -format

Start HDFS, YARN and History server

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

Run below command to check status

jps

Output

1570 NameNode

2290 JobHistoryServer

2002 ResourceManager

2356 Jps

1774 SecondaryNameNode

Check below page for status on HDFS
http://localhost:50090

Finally run few HDFS commands to ensure it’s working,

hadoop fs -ls /
hadoop fs -mkdir /test1

Backup docker container as image and push it to docker repository

docker commit ubuntu_hadoop hadooptutorials/hadoop:hdp_single_node
docker login
docker push hadooptutorials/hadoop:hdp_single_node

Run container from new image

docker run  --name ubuntu_hadoop --hostname=quickstart.hadoop --privileged=true -t -i -p 8888:8888 -p 7180:7180 -p 30022:22 -p 50070:50070 -p 50090:50090 -it hadooptutorials/hadoop:hdp_single_node

Start HDFS from this container

source ~/.bashrc
service ssh start
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
jps

For future we can create a start script to execute above commands when we start docker container. This way it’s easy to maintain startup commands.

vi ~/start.sh

Copy below lines in this file

source ~/.bashrc
service ssh start
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
jps

Make start.sh executable

chmod +x ~/start.sh

Next time when you run container, just run start.sh to start hadoop.

Similarly create stop.sh file to stop services before exiting or stopping container. This way it’s safer to stop services first.

vi ~/stop.sh

Add below lines in this file

source ~/.bashrc
$HADOOP_HOME/sbin/stop-dfs.sh
$HADOOP_HOME/sbin/stop-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver

make it executable,

chmod +x ~/stop.sh

Leave a Reply

Your email address will not be published. Required fields are marked *