Before we move ahead lets learn a bit on Setup Apache Spark,
So, What is Apache Spark?
Apache Spark is a fast, real time and extremely expressive computing system which executes job in distributed (clustered) environment.
It is quite compatible with Apache Hadoop and more almost 10x faster than Hadoop MapReduce on Disk Computing and 100x faster using in memory computations. It provides rich APIs in Java, Scala and Python along with Functional Programming capabilities.
This post will give you clear idea on setting up Spark Multi Node cluster on CentOS with Hadoop and YARN.
Before moving forward I assume that you are aware about how to install Java 7 and Apache Hadoop with YARN on CentOS cluster,
- Steps to Install Java 7 On CentOS (Installation process is same as Java 8)
- Steps to Install Apache Hadoop with YARN
Step 1. Download Apache Spark using below commands
$ cd /home/
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.1.tgz
$ tar –xvf spark-1.0.1.tgz
Step 2. Configuration in spark-env.sh
/home/spark-1.0.1-bin-hadoop2/conf/spark-env.sh and add below lines to the file
/home/spark-1.0.1-bin-hadoop2/conf/spark-defaults.conf and add below lines to the file.
Append hostnames of all the slave nodes in
[Repeat same above step 1 and 2 on other slave nodes (slave1.backtobazics.com in our case)]
Step 3. Start/Stop Spark using below commands
$ sh /home/spark-1.0.1-bin-hadoop2/sbin/start-all.sh
$ sh /home/spark-1.0.1-bin-hadoop2/sbin/stop-all.sh
Step 4. Start Spark shell using YARN
$ cd /home/spark-1.0.1-bin-hadoop2
$ ./bin/spark-shell --master yarn-client
Step 5. Creating a sample text file on HDFS for WordCount example
Create a simple text file
sample.txt with following content.
apache spark is a fast, real time and extremely expressive computing system which executes job
in distributed (clustered) environment.
Put above file on HDFS using following command.
$ hdfs dfs -copyFromLocal ./sample.txt /user/root/
Step 6. Execute following steps of word count example
After you put your sample text file on HDFS, execute following set of commands which will perform word count on Spark Cluster.
scala> val logFile = "hdfs://master.backtobazics.com:9000/user/root/sample.txt"
logFile: String = hdfs://master.backtobazics.com:9000/user/root/sample.txt
scala> val file = sc.textFile(logFile)
file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at textFile at :23
scala> val counts = file.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at :25
res1: Array[(String, Int)] = Array((Python,1), (is,4), ((Installation,1), (same,1), (with,5), (MapReduce,1), (Java,3), (we,1), (This,1), ((clustered),1), (using,1), (CentOS,3), (aware,1), (post,1), (What,1), (setting,1), (computing,1), (lets,1), (computations.,1),
(executes,1), (learn,1), (are,1), (assume,1), (YARN.,1), (provides,1), (expressive,1), (real,1), (cluster,,1), (So,,1), (Java,,1), (moving,1), (Apache,6), (how,1), (will,1), (compatible,1), (YARN,2), (as,1), ("",16), (Spark?,1), (capabilities.,1), (cluster,1), (Scala,1), (almost,1), (quite,1), (fast,,1), (Computing,1), (rich,1), (Node,1), (Spark,2), (job,1),(environment.,1), (about,1), (than,1), (7,2), (APIs,1), (on,5), (10x,1), (in,3), (which,1), (100x,1), (Install,2), (extremely,1), (along,1), (install,1), (distributed,1), ...
That’s it….. You are done. 🙂
You can access SPARK UI in Browser by below URL
Spark Master URL: http://master.backtobazics.com:8088/
Check my post related to Building Spark Application JAR using Scala and SBT for more information on Submitting Spark job on YARN cluster.
Thank you for reading this post…..!!!!! n Stay tuned for more such posts…..