In this tutorial we will discuss you how to install Spark on Ubuntu VM. Spark do not have particular dependency on Hadoop or other tools. But if you are planning to use Spark with Hadoop then you should follow my Part-1, Part-2 and Part-3 tutorial which covers installation of Hadoop and Hive.
Install Java and Scala
To install Spark, first you need to ensure you have java installed. Run command “java -version” to check installed version.
java -version Output ------ openjdk version "1.8.0_265" OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~16.04-b01) OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)
If you don’t have java installed, then first install it using below command.
sudo apt-get install default-jdk
For Spark to work, Scala needs to be installed. Install using below command,
sudo apt-get install scala
Once installed, type “scala” in terminal and you should see scala prompt like below.
Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144). Type in expressions to have them evaluated. Type :help for more information. scala>:q
Quit Scala prompt with “:q” command.
Visit below Apache Spark link and get download link for pre-built version of Spark. Depending on your need, you may want to download a particular version. In my case, I want to install Spark version 2.3.0 because it’s compatible with my Hive version 3.1.4. I will be using Spark as execution engine for Hive later.
Old version of Spark (2.3.0 or earlier) can be downloaded from below link.
Go to Downloads directory and download spark. Extract Spark .tar.gz file.
cd ~/Downloads wget https://archive.apache.org/dist/spark/spark-2.3.0/spark-2.3.0-bin-hadoop2.7.tgz tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz
Move extracted spark folder to /usr/lib/spark/ folder.
sudo mkdir /usr/lib/spark sudo mv ~/Downloads/spark-2.3.0-bin-hadoop2.7 /usr/lib/spark
Set Spark environment variable in ~/.bashrc file.
Add below lines in ~/.bashrc file at the end.
#Set SPARK home export SPARK_HOME=/usr/lib/spark/spark-2.3.0-bin-hadoop2.7 PATH=$PATH:$SPARK_HOME/bin export PATH export SPARK_MASTER_HOST=localhost export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=7180
Save and exit ~/.bashrc file. Then source ~/.bashrc file to reload environment variables.
Spark installation is complete now.
Start Spark services
To start Spark master and other Spark services, use below command.
$SPARK_HOME/sbin/start-all.sh Output ------ starting org.apache.spark.deploy.master.Master, logging to /usr/lib/spark/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop_user-org.apache.spark.deploy.master.Master-1-hadoop-master.out localhost: starting org.apache.spark.deploy.worker.Worker, logging to /usr/lib/spark/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop_user-org.apache.spark.deploy.worker.Worker-1-hadoop-master.out
You can access Spark WebUI using this URL -> http://[Server IP]:7180
You should see webpage like below.
If you are using google cloud VM, then you need to open port 7180 in order to access Spark Web UI. By default all ports are blocked on google cloud VM. Firewall can be opened by going to Google Cloud Console Menu -> VPC Network -> Firewall -> Create Firewall.