Part-3: Install Apache HIVE on Hadoop Cluster

By | October 11, 2020

In this part we will discuss how to install Apache HIVE on Hadoop HDFS file system.

Hive is an SQL client layer. It resides on top of Hadoop file system. It helps to read/write data in HDFS using familiar SQL commands.

Download and copy Hive

I assume that you have followed instructions from Part-1 on how to install Hadoop on single node cluster. Make sure Hadoop is running.

Get latest version of “hive-*-bin.tar.gz” file link from Apache hive site.

https://hive.apache.org/downloads.html

Ensure to download hive version that matches with major hadoop version that you have installed. For example, if you have installed hadoop version 3.x.x then download hive version 3.x.x.

Login to Hadoop master node with SSH. Go to Downloads directory and download latest Hive tar.gz file. Extract it.

cd ~/Downloads
wget http://apache.forsale.plus/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
tar -zxvf apache-hive-3.1.2-bin.tar.gz

Create new directory for hive. Move extracted hive folder in new directory.

sudo mkdir /usr/lib/hive
sudo mv apache-hive-3.1.2-bin /usr/lib/hive

Setup environment variables

Edit ~/.bashrc file. Set hive home path.

vi ~/.bashrc

Add below lines at the end. Adjust Hive path as per your Hive version.

#Set HIVE_HOME
export HIVE_HOME="/usr/lib/hive/apache-hive-3.1.2-bin"
PATH=$PATH:$HIVE_HOME/bin
export PATH

Reload environment variables. Or logout and log back in.

source ~/.bashrc

Configure Hive

Create Hive directory in HDFS to store data.

hadoop fs -mkdir /usr/
hadoop fs -mkdir /usr/hive
hadoop fs -mkdir /usr/hive/warehouse
hadoop fs -mkdir /tmp

Set READ/WRITE permission for “warehouse” and temporary directory

hadoop fs -chmod g+w /usr/hive/warehouse
hadoop fs -chmod g+w /tmp

Edit Hive environment file.

cd $HIVE_HOME/conf
#Rename or copy environment template script
cp hive-env.sh.template hive-env.sh
#Make it executable
chmod +x hive-env.sh
#Edit environment script.
vi $HIVE_HOME/conf/hive-env.sh

Add below lines at the end of hive-env.sh file. Adjust Hive path as per your Hive version.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_HEAPSIZE=512
export HIVE_CONF_DIR=/usr/lib/hive/apache-hive-3.1.2-bin/conf
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export YARN_HOME=$HADOOP_YARN_HOME

Set hive-env.sh file as executable.

chmod +x $HIVE_HOME/conf/hive-env.sh

Enable Logging (Optional)

To enable logging when starting hive, you can rename below log4j template files. This step is optional but can be useful when you receive errors from Hive. It will show additional debugging information.

cd $HIVE_HOME/conf
mv hive-log4j2.properties.template hive-log4j2.properties
mv hive-exec-log4j2.properties.template hive-exec-log4j2.properties

Change log level to ERROR only in these 2 files. Adjust file names as per your version.

vi hive-log4j2.properties
vi hive-exec-log4j2.properties
#Change "status = INFO" line to "status = ERROR" in above 2 files

Add or update classpath property in mapred-site.xml file. This line may already be correct in mapred-site.xml. But sometimes default paths with environment variables do not work. So provide absolute paths without using environment variables.

vi $HADOOP_CONF_DIR/mapred-site.xml

Add or update path values for various jars in mapred-site.xml. Make sure path values are separated by comma. It should not be broken in multiple lines between <value> and </value> tags.

<property>
<name>mapreduce.application.classpath</name>
<value>
$HADOOP_CLIENT_CONF_DIR,$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH,/usr/local/hadoop/share/hadoop/mapreduce/*,/usr/local/hadoop/share/hadoop/mapreduce/lib/*,/usr/local/hadoop/share/hadoop/hdfs/*,/usr/local/hadoop/share/hadoop/hdfs/lib/*,/usr/local/hadoop/share/hadoop/common/lib/*,/usr/local/hadoop/share/hadoop/common/*,/usr/local/hadoop/share/hadoop/yarn/lib/*,/usr/local/hadoop/share/hadoop/yarn/*
</value>
</property>

Create new file “hive-site.xml”. This file contains Hive configuration.

vi $HIVE_HOME/conf/hive-site.xml

Add meta-store connection URL in this file. Adjust hive path as per your version. Default meta-store is Derby database for Hive.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:/usr/lib/hive/apache-hive-3.1.2-bin/conf/metastore_db;create=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>
</configuration>  

Remove metastore directory if it already exist. This may exist from previous installations.

rm -rf /usr/lib/hive/apache-hive-3.1.2-bin/conf/metastore_db

Initialize Hive meta-store schema using below commands for Derby database.

$HIVE_HOME/bin/schematool -initSchema -dbType derby

On running above command, you may see below error.

Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338) at org.apache.hadoop.mapred.JobConf.setJar(JobConf.java:518) at org.apache.hadoop.mapred.JobConf.setJarByClass(JobConf.java:536) at org.apache.hadoop.mapred.JobConf.(JobConf.java:430) at org.apache.hadoop.hive.conf.HiveConf.initialize(HiveConf.java:5141) at org.apache.hadoop.hive.conf.HiveConf.(HiveConf.java:5104) at org.apache.hive.beeline.HiveSchemaTool.(HiveSchemaTool.java:96) at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:1473) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43 at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.util.RunJar.run(RunJar.java:318) at org.apache.hadoop.util.RunJar.main(RunJar.java:232)

This error occurs due to a Bug mentioned in the following Hive issue link: HIVE-22718.
As mentioned in the comments, this issue can be solved by replacing the guava-19.0.jar stored in “$HIVE_HOME\lib” with Hadoop’s guava-27.0-jre.jar found in “$HADOOP_HOME\share\hadoop\hdfs\lib”.

cd $HIVE_HOME/lib
#Remove old guava-*.jar file
ls guava-*.jar
rm guava-*.jar
#Copy new jar
cp /usr/local/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar .

Try initiating schema again. This time it should work.

$HIVE_HOME/bin/schematool -initSchema -dbType derby

Installation is now complete. Run hive command and you should see “hive>” prompt similar to below.

hadoop_user@hadoop-master:~$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = b11e9a95-5511-41a0-81ff-94cd5779fa4b

Logging initialized using configuration in file:/usr/lib/hive/apache-hive-3.1.2-bin/conf/hive-log4j2.properties Async: true
Hive Session ID = 3d5547c6-5988-4090-9d90-c730b642a84c
hive> 

Hive installation is now complete.

Test Hive

Run below command to list tables in Hive. It will return no tables because this is a fresh installation.

hive> show tables;

You may encounter below error on running above command.

FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

This error may happen if there is corruption in Derby database. If you face above error then quit hive and run below command to fix it.

$HIVE_HOME/bin/schematool -upgradeSchema -dbType derby

Now let’s create our first Hive table with below command on hive.

hive> create table employee (id string, name string, dept string);
OK
Time taken: 0.617 seconds

Insert a record in this table. On inserting new record hive will generate Map Reduce job in order to do insert.

hive> insert into employee values("1","Allen","IT");
Query ID = hadoop_user_20201012001907_ee2aa735-b6b2-406f-9f28-e3524a8151f1
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1602303176337_0002, Tracking URL = http://ip6-loopback:5349/proxy/application_1602303176337_0002/
Kill Command = /usr/local/hadoop/bin/mapred job  -kill job_1602303176337_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-10-12 00:19:15,771 Stage-1 map = 0%,  reduce = 0%
2020-10-12 00:19:21,000 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.21 sec
2020-10-12 00:19:26,253 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.67 sec
MapReduce Total cumulative CPU time: 3 seconds 670 msec
Ended Job = job_1602303176337_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://0.0.0.0:54310/user/hive/warehouse/employee/.hive-staging_hive_2020-10-12_00-19-07_393_6656227356869056987-1/-ext-10000
Loading data to table default.employee
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.67 sec   HDFS Read: 15823 HDFS Write: 284 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 670 msec
OK
Time taken: 20.94 seconds

Let’s look at inserted record.

hive> select * from employee;
OK
1       Allen   IT
Time taken: 0.249 seconds, Fetched: 1 row(s)

New version of hive CLI is called BEELINE which can be run using below command.

$HIVE_HOME/bin/beeline -u jdbc:hive2://

Normal map reduce job makes hive queries very slow. To run hive queries faster, we can use local map reduce mode. This can be set by below command.

hive> SET hive.exec.mode.local.auto=true;

Note that this feature is disabled by default. If enabled, Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied:

  • The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default)
  • The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default)
  • The total number of reduce tasks required is 1 or 0.

Troubleshooting

On running insert query, hive may get stuck on map reduce job for a long time and never finish running. Like below.

hive> insert into employee values("2","Dana","Sales");
Query ID = hadoop_user_20200917220704_9b75fcb8-5e3f-48d0-acc7-f365ba0a6ff6
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1600379673028_0002, Tracking URL = http://localhost:8088/proxy/application_1600379673028_0002/
Kill Command = /usr/local/hadoop/bin/mapred job  -kill job_1600379673028_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-09-17 22:07:16,479 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:08:17,055 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:09:17,411 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:10:17,680 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:11:17,973 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:12:18,208 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:13:18,356 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:14:18,466 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:15:18,558 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:16:18,651 Stage-1 map = 0%,  reduce = 0%
2020-09-17 22:17:18,810 Stage-1 map = 0%,  reduce = 0%

This may happen if job tracking URL is pointing to an ip address that is not in service. In above example Hive is looking for tracking URL at http://localhost:8088. But actual URL is http://<internal ip>:8088. In order to fix this, you need to point localhost to <internal ip address> of VM by editing /etc/hosts file.

Leave a Reply

Your email address will not be published. Required fields are marked *