Part-5 : Using Spark as execution engine for Hive

In this tutorial I will demonstrate how to use Spark as execution engine for hive. MapReduce is a default execution engine for Hive. But usually it’s very slow execution engine. Spark execution engine is better faster engine for running queries on Hive.

I assume you already have a running Hive and Spark installation. If not then follow instructions in below articles to install them.
http://hadooptutorials.info/2017/09/15/part-2-install-hive/
http://hadooptutorials.info/2017/09/18/part-4-install-spark/

Make sure below environment variables exist in ~/.bashrc file. JAVA_HOME should point to you java installation directory.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=$JAVA_HOME/jre
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

#HIVE
export HIVE_HOME=/usr/lib/hive/apache-hive-2.3.0-bin
PATH=$PATH:$HIVE_HOME/bin
export HIVE_CONF_DIR=$HIVE_HOME/conf
export PATH

#SPARK
export SPARK_HOME=/usr/lib/spark/spark-2.2.0-bin-hadoop2.7
PATH=$PATH:$SPARK_HOME/bin
export PATH
export HADOOP_YARN_HOME=$HADOOP_HOME
export YARN_CONF_DIR=$HADOOP_CONF_DIR

Reload environment variables

source ~/.bashrc

link scala and spark jars in Hive lib folder

cd $HIVE_HOME/lib
ln -s $SPARK_HOME/jars/scala-library*.jar
ln -s $SPARK_HOME/jars/spark-core*.jar
ln -s $SPARK_HOME/jars/spark-network-common*.jar

Add below configurations in hive-site.xml to use Spark execution engine

vi $HIVE_HOME/conf/hive-site.xml
<property>
    <name>hive.execution.engine</name>
    <value>spark</value>
    <description>Use Map Reduce as default execution engine</description>
</property>
<property>
    <name>spark.master</name>
    <value>spark://localhost:7077</value>
  </property>
<property>
    <name>spark.eventLog.enabled</name>
    <value>true</value>
  </property>
<property>
    <name>spark.eventLog.dir</name>
    <value>/tmp</value>
  </property>
<property>
    <name>spark.serializer</name>
    <value>org.apache.spark.serializer.KryoSerializer</value>
  </property>
<property>
  <name>spark.yarn.jars</name>
  <value>hdfs://localhost:54310/spark-jars/*</value>
</property>

Make sure below properties exist in yarn-site.xml. If not add them. These jar paths are needed when using Spark as execution engine for hive. I had to use absolute paths instead of environment variables in below configuration. For some reason environment variables did not work. Make sure these paths refer to your hadoop installation directories.

vi $HADOOP_CONF_DIR/yarn-site.xml
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/share/hadoop/mapreduce/*,/usr/local/hadoop/share/hadoop/mapreduce/lib/*,/usr/local/hadoop/share/hadoop/hdfs/*,/usr/local/hadoop/share/hadoop/hdfs/lib/*,/usr/local/hadoop/share/hadoop/common/lib/*,/usr/local/hadoop/share/hadoop/common/*,/usr/local/hadoop/share/hadoop/yarn/lib/*,/usr/local/hadoop/share/hadoop/yarn/*</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>/usr/local/hadoop/share/hadoop/mapreduce/*,/usr/local/hadoop/share/hadoop/mapreduce/lib/*,/usr/local/hadoop/share/hadoop/hdfs/*,/usr/local/hadoop/share/hadoop/hdfs/lib/*,/usr/local/hadoop/share/hadoop/common/lib/*,/usr/local/hadoop/share/hadoop/common/*,/usr/local/hadoop/share/hadoop/yarn/lib/*,/usr/local/hadoop/share/hadoop/yarn/*</value>
</property>

Remove old version of Hive jars from Spark jars folder. This step should be changed as per your version of Hive jars in Spark folder. You can determine version by looking at content of $SPARK_HOME/jars folder with below command

ls $SPARK_HOME/jars/*hive*.jar

In my case those jars were having version 1.2.1. So remove them with below command.

rm $SPARK_HOME/jars/*hive*1.2.1*

Run below command to copy new version of Hive jars to Spark jars folder. These jars are necessary in order to run Hive with new Spark engine that we have.

cp $HIVE_HOME/lib/*hive*.jar $SPARK_HOME/jars/

Run below commands to copy spark jars on HDFS spark-jars folder

hadoop fs -mkdir /spark-jars
hadoop fs -put $SPARK_HOME/jars/*.jar /spark-jars/

All configuration are now complete. Now run Hive and try inserting a new record in a table. You should see Spark engine used while execution.

Leave a Reply

Your email address will not be published. Required fields are marked *