6 Steps to Setup Apache Spark 1.0.1 (Multi Node Cluster) on CentOS

Before we move ahead lets learn a bit on Setup Apache Spark,

So, What is Apache Spark?

Apache Spark is a fast, real time and extremely expressive computing system which executes job in distributed (clustered) environment.

It is quite compatible with Apache Hadoop and more almost 10x faster than Hadoop MapReduce on Disk Computing and 100x faster using in memory computations. It provides rich APIs in Java, Scala and Python along with Functional Programming capabilities.

This post will give you clear idea on setting up Spark Multi Node cluster on CentOS with Hadoop and YARN.

Before moving forward I assume that you are aware about how to install Java 7 and Apache Hadoop with YARN on CentOS cluster,

Step 1. Download Apache Spark using below commands

$ cd /home/
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.1.tgz
$ tar –xvf spark-1.0.1.tgz

Step 2. Configuration in spark-env.sh

Create /home/spark-1.0.1-bin-hadoop2/conf/spark-env.sh and add below lines to the file

SPARK_JAVA_OPTS=-Dspark.driver.port=53411
HADOOP_CONF_DIR=$HADOOP_HOME/conf
SPARK_MASTER_IP=master.backtobazics.com

Create /home/spark-1.0.1-bin-hadoop2/conf/spark-defaults.conf and add below lines to the file.

spark.master            spark://master.backtobazics.com:7077
spark.serializer        org.apache.spark.serializer.KryoSerializer

Append hostnames of all the slave nodes in /home/spark-1.0.1-bin-hadoop2/conf/slaves file

master.backtobazics.com
slave1.backtobazics.com

[Repeat same above step 1 and 2 on other slave nodes (slave1.backtobazics.com in our case)]

Step 3. Start/Stop Spark using below commands

$ sh /home/spark-1.0.1-bin-hadoop2/sbin/start-all.sh
$ sh /home/spark-1.0.1-bin-hadoop2/sbin/stop-all.sh

Step 4. Start Spark shell using YARN

$ cd /home/spark-1.0.1-bin-hadoop2
$ ./bin/spark-shell --master yarn-client

Above command will launch spark shell where you can get a Scala prompt. Now you can start writing your spark code command by command and it will executed as soon as you write

Step 5. Creating a sample text file on HDFS for WordCount example

Create a simple text file sample.txt with following content.

apache spark is a fast, real time and extremely expressive computing system which executes job
 in distributed (clustered) environment.

Put above file on HDFS using following command.

$ hdfs dfs -copyFromLocal ./sample.txt /user/root/

Step 6. Execute following steps of word count example

After you put your sample text file on HDFS, execute following set of commands which will perform word count on Spark Cluster.

scala> val logFile = "hdfs://master.backtobazics.com:9000/user/root/sample.txt"
logFile: String = hdfs://master.backtobazics.com:9000/user/root/sample.txt

scala> val file = sc.textFile(logFile)
file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :23
scala> val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[7] at reduceByKey at :25

scala> counts.collect()
res1: Array[(String, Int)] = Array((Python,1), (is,4), ((Installation,1), (same,1), (with,5), (MapReduce,1), (Java,3), (we,1), (This,1), ((clustered),1), (using,1), (CentOS,3), (aware,1), (post,1), (What,1), (setting,1), (computing,1), (lets,1), (computations.,1), 
(executes,1), (learn,1), (are,1), (assume,1), (YARN.,1), (provides,1), (expressive,1), (real,1), (cluster,,1), (So,,1), (Java,,1), (moving,1), (Apache,6), (how,1), (will,1), (compatible,1), (YARN,2), (as,1), ("",16), (Spark?,1), (capabilities.,1), (cluster,1), (Scala,1), (almost,1), (quite,1), (fast,,1), (Computing,1), (rich,1), (Node,1), (Spark,2), (job,1),(environment.,1), (about,1), (than,1), (7,2), (APIs,1), (on,5), (10x,1), (in,3), (which,1), (100x,1), (Install,2), (extremely,1), (along,1), (install,1), (distributed,1), ...

That’s it….. You are done. :)

You can access SPARK UI in Browser by below URL
Spark Master URL: http://master.backtobazics.com:8088/

Check my post related to Building Spark Application JAR using Scala and SBT for more information on Submitting Spark job on YARN cluster.

Thank you for reading this post…..!!!!! n Stay tuned for more such posts…..

9 thoughts on “6 Steps to Setup Apache Spark 1.0.1 (Multi Node Cluster) on CentOS”

  1. Thanks for your tutorial.
    1. Do you think the settings are same for ubuntu 12.04.
    2. Do I have to install spark on other hadoop slave nodes or just the master node.

    1. Thanks for reading Ahmad. Here are your answers…

      1. Yes steps and settings are same on ubuntu as well.
      2. Of course you have to install spark on slave nodes. Slave nodes are where your spark worker nodes will run. As i mentioned under Step 2, you need to repeat step 1 and 2 in all of your slave nodes.

      1. My installation is complete. Thanks a lot.
        Now, when I run following command

        ./bin/spark-submit –class my.main.Class –master yarn-cluster

        Error: Must specify a primary resource (JAR or Python or R file)

        FYI:
        I created this direcorty sudo mkdir -p /data/WordCount/src/main/scala/com/backtobazics/spark/wordcount
        after that using chown command change the owner of directory.
        created following file at above path
        WordCount.Scala

        package com.backtobazics.spark.wordcount

        import org.apache.spark.{SparkConf, SparkContext}
        import org.apache.spark.SparkContext._

        object ScalaWordCount {
        def main(args: Array[String]) {
        val logFile = “hdfs://maroof:9000/user/hduser/samples/pg20417.txt”
        val sparkConf = new SparkConf().setAppName(“Spark Word Count”)
        val sc = new SparkContext(sparkConf)
        val file = sc.textFile(logFile)
        val counts = file.flatMap(_.split(“\\|”)).map(word => (word, 1)).reduceByKey(_ + _)
        counts.saveAsTextFile(“hdfs://maroof:9000/user/root/output”)
        }
        }

          1. Thank you for your help. I will read them and if you don’t mind, consult you in case on any issues.

            Thanks Again!!!

  2. Thank for your intersted article , I like to make a real example of creating Multi Node (up to 10 nodes) Hadoop 2.6.0 Cluster with YARN with SPARK on ubuntu with an java example of word count example.

    I have another question, i have to make a genetic algorithm to run on hadoop and apache spark , you know that genetic algorithm is iterative with its nature. my genetic algorithm will have and extensive read/worte of many file that contains sentences, what is you idea to make this.

  3. Hi varun,

    I am trying to install spark on my system, bu i am getting “bash: spark-shell: command not found….” could you please provide any solution for this.. i am not able to resolve this.

    Thanks,
    Girish

    1. Hi Girish, Can you tell me exactly at which point you are getting this error? It might be possible that you are directly executing spark-submit command without setting $SPARK_HOME/bin to your $PATH variable. Please try to execute following commands,

      1) $SPARK_HOME/bin/spark-shell –master yarn-client

      OR

      2) Set path variable and execute the command
      – export $PATH=$PATH:$SPARK_HOME/bin/
      – spark-shell –master yarn-client

      where $SPARK_HOME = [spark installation directory]

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>