Part-4: How to Install Spark

By | October 11, 2020

In this tutorial we will discuss you how to install Spark on Ubuntu VM. Spark do not have particular dependency on Hadoop or other tools. But if you are planning to use Spark with Hadoop then you should follow my Part-1, Part-2 and Part-3 tutorial which covers installation of Hadoop and Hive.

Install Java and Scala

To install Spark, first you need to ensure you have java installed. Run command “java -version” to check installed version.

java -version
Output
------
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~16.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)

If you don’t have java installed, then first install it using below command.

sudo apt-get install default-jdk

For Spark to work, Scala needs to be installed. Install using below command,

sudo apt-get install scala

Once installed, type “scala” in terminal and you should see scala prompt like below.

Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144).
Type in expressions to have them evaluated.
Type :help for more information.
scala>:q

Quit Scala prompt with “:q” command.

Install Spark

Visit below Apache Spark link and get download link for pre-built version of Spark. Depending on your need, you may want to download a particular version. In my case, I want to install Spark version 2.3.0 because it’s compatible with my Hive version 3.1.4. I will be using Spark as execution engine for Hive later.

https://spark.apache.org/downloads.html

Old version of Spark (2.3.0 or earlier) can be downloaded from below link.

https://archive.apache.org/dist/spark/

Go to Downloads directory and download spark. Extract Spark .tar.gz file.

cd ~/Downloads
wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz
tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz

Move extracted spark folder to /usr/lib/spark/ folder.

sudo mkdir /usr/lib/spark
sudo mv ~/Downloads/spark-2.3.0-bin-hadoop2.7 /usr/lib/spark

Set Spark environment variable in ~/.bashrc file.

vi ~/.bashrc

Add below lines in ~/.bashrc file at the end.

#Set SPARK home
export SPARK_HOME=/usr/lib/spark/spark-2.3.0-bin-hadoop2.7
PATH=$PATH:$SPARK_HOME/bin
export PATH
export SPARK_MASTER_HOST=localhost
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=7180

Save and exit ~/.bashrc file. Then source ~/.bashrc file to reload environment variables.

source ~/.bashrc

Spark installation is complete now.

Start Spark services

To start Spark master and other Spark services, use below command.

$SPARK_HOME/sbin/start-all.sh
Output
------
starting org.apache.spark.deploy.master.Master, logging to /usr/lib/spark/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop_user-org.apache.spark.deploy.master.Master-1-hadoop-master.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/lib/spark/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop_user-org.apache.spark.deploy.worker.Worker-1-hadoop-master.out

You can access Spark WebUI using this URL -> http://[Server IP]:7180

You should see webpage like below.

Spark Web UI

If you are using google cloud VM, then you need to open port 7180 in order to access Spark Web UI. By default all ports are blocked on google cloud VM. Firewall can be opened by going to Google Cloud Console Menu -> VPC Network -> Firewall -> Create Firewall.

Leave a Reply

Your email address will not be published. Required fields are marked *