Building Spark Application JAR using Scala and SBT

Normally we create Spark Application JAR using Scala and SBT (Scala Build Tool). In my previous post on Creating Multi-node Spark Cluster we have executed a word count example using spark shell. As an extension to that, we’ll learn about How to create Spark Application JAR file with Scala and SBT? and How to execute it as a Spark Job on Spark Cluster?

So Let’s Begin…..!!!!!

Step 1: Installing SBT on CentOS

Installation of SBT on CentOS is very straight forward. First we’ll download SBT zip on master node and than we’ll export bin path in $PATH environment variable. Before that, Have you installed Scala on master node yet? NO!!! Then do it first. Here is a complete post for Installing Scala on CentOS.

Once you are done with Scala Installation, let’s move forward with SBT Installation steps.

Download SBT

Extract the zip file and Set $PATH variable

Step 2: Creating a template of Spark Application of Word Count

Now create scala WordCount project directory structure under /data directory following below commands.

Than add below file content in WordCount.scala file. This is the same scala code which we have executed using spark-shell in spark cluster setup post.

Now go to /data/WordCount/ directory and verify the project structure. It should be same as below.

Step 3: Building Spark Application JAR using SBT

In order to build Spark Application JAR file, create WordCount.sbt file under /data/WordCount/ directory with below content. Make sure that you put proper version of Scala and Spark Libraries in below file.

Now it’s time to build JAR file of our word count program. execute below command under /data/WordCount/ directory where you have created WordCount.sbt file.

It will take some time to download dependency jar file for first time. SBT will put all of it’s dependency classes in ~/.ivy2/cache/ directory.

Step 4: Submit Spark Job using YARN cluster

You can submit your Spark job to YARN cluster using below syntax.

In our case actual command will be,

Note: Before you execute spark job make sure you put sample.txt file in /user/root/ HDFS directory and delete /user/root/output on HDFS.

Wait for job to complete and meanwhile monitor your job from Spark Web UI @ http://master.backtobazics.com:8088/

After your job is completed. Check your output at /user/root/output HDFS directory.

Write your comments below if you like this post…..!!!!!

References:

https://spark.apache.org/docs/1.1.0/submitting-applications.html

2 comments

  1. Certainly nice informative article about spark.

  2. Subhrajit Bhattacharya

    The simplicity of the example was just what I was looking for to get started. Thank you.

Leave a Reply

Your email address will not be published. Required fields are marked *