Building Spark Application JAR using Scala and SBT

Normally we create Spark Application JAR using Scala and SBT (Scala Build Tool). In my previous post on Creating Multi-node Spark Cluster we have executed a word count example using spark shell. As an extension to that, we’ll learn about How to create Spark Application JAR file with Scala and SBT? and How to execute it as a Spark Job on Spark Cluster?

So Let’s Begin…..!!!!!

Step 1: Installing SBT on CentOS

Installation of SBT on CentOS is very straight forward. First we’ll download SBT zip on master node and than we’ll export bin path in $PATH environment variable. Before that, Have you installed Scala on master node yet? NO!!! Then do it first. Here is a complete post for Installing Scala on CentOS.

Once you are done with Scala Installation, let’s move forward with SBT Installation steps.

Download SBT
$ cd /opt/
$ wget https://dl.bintray.com/sbt/native-packages/sbt/0.13.8/sbt-0.13.9.tgz

Extract the zip file and Set $PATH variable

## Extracting the Zip
$ tar -xvf sbt-0.13.9.tgz

## Set PATH variable for all users
$ echo "" >> /etc/profile
$ echo "## Setting SBT for all USERS ##" >> /etc/profile
$ echo "export PATH=\$PATH:/opt/sbt/bin" >> /etc/profile
$ source /etc/profile

Step 2: Creating a template of Spark Application of Word Count

Now create scala WordCount project directory structure under /data directory following below commands.

$ mkdir –p /data/WordCount/src/main/scala/com/backtobazics/spark/wordcount/
$ cd /data/WordCount/src/main/scala/com/backtobazics/spark/wordcount
$ vi WordCount.scala

Than add below file content in WordCount.scala file. This is the same scala code which we have executed using spark-shell in spark cluster setup post.

package com.backtobazics.spark.wordcount
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
 
object ScalaWordCount {
 def main(args: Array[String]) {
 val logFile = "hdfs://master.backtobazics.com:9000/user/root/sample.txt"
 val sparkConf = new SparkConf().setAppName("Spark Word Count")
 val sc = new SparkContext(sparkConf)
 val file = sc.textFile(logFile)
 val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) 
 counts.saveAsTextFile("hdfs://master.backtobazics.com:9000/user/root/output")
 }
}

Now go to /data/WordCount/ directory and verify the project structure. It should be same as below.

$ find /data/WordCount
/data/WordCount
/data/WordCount/src
/data/WordCount/src/main
/data/WordCount/src/main/scala
/data/WordCount/src/main/scala/com
/data/WordCount/src/main/scala/com/backtobazics
/data/WordCount/src/main/scala/com/backtobazics/spark
/data/WordCount/src/main/scala/com/backtobazics/spark/wordcount
/data/WordCount/src/main/scala/com/backtobazics/spark/wordcount/WordCount.scala

Step 3: Building Spark Application JAR using SBT

In order to build Spark Application JAR file, create WordCount.sbt file under /data/WordCount/ directory with below content. Make sure that you put proper version of Scala and Spark Libraries in below file.

name := "ScalaWordCount"
 
version := "1.0"
 
scalaVersion := "2.11.7"
 
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.1"

Now it’s time to build JAR file of our word count program. execute below command under /data/WordCount/ directory where you have created WordCount.sbt file.

$ cd /data/WordCount/
$ sbt package
[info] Set current project to ScalaWordCount (in build file:/data/WordCount/)
[info] Compiling 1 Scala source to /data/WordCount/target/scala-2.11/classes...
[info] Packaging /data/WordCount/target/scala-2.11/scalawordcount_2.11-1.0.jar ...
[info] Done packaging.
[success] Total time: 25 s, completed Oct 21, 2015 8:59:41 AM

It will take some time to download dependency jar file for first time. SBT will put all of it’s dependency classes in ~/.ivy2/cache/ directory.

Step 4: Submit Spark Job using YARN cluster

You can submit your Spark job to YARN cluster using below syntax.

$ /opt/spark-1.0.1/bin/spark-submit \
     --class [my.main.Class] \
     --master yarn-cluster [application jar] 

In our case actual command will be,

$ /opt/spark-1.0.1/bin/spark-submit \
      --class com.backtobazics.spark.wordcount.ScalaWordCount \
      --master yarn-cluster /data/WordCount/target/scala-2.11/scalawordcount_2.11-1.0.jar

Note: Before you execute spark job make sure you put sample.txt file in /user/root/ HDFS directory and delete /user/root/output on HDFS.

Wait for job to complete and meanwhile monitor your job from Spark Web UI @ http://master.backtobazics.com:8088/

After your job is completed. Check your output at /user/root/output HDFS directory.

Write your comments below if you like this post…..!!!!!

References:

https://spark.apache.org/docs/1.1.0/submitting-applications.html

One thought on “Building Spark Application JAR using Scala and SBT”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>