Apache Spark architecture enables to write computation application which are almost 10x faster than traditional Hadoop MapReuce applications. We have already discussed about features of Apache Spark in the introductory post.
Apache Spark doesn’t provide any storage (like HDFS) or any Resource Management capabilities. It is just a unified framework for processing large amount of data near to real time. In below figure, Apache Spark framework is organized in three major layers.
Spark Core Layer:
As you can see Spark Core is the generalized layer of the framework. Spark core has the definition of all the basic functions. All other functionalities and extensions are built on top of Spark Core.
Other Language capabilities:
- Spark is totally written on Scala (a Functional as well as Object Oriented Programming Language) which runs on top of JVM
- Apart from Scala, Spark also supports languages like Java and Python
- Recently Spark has added the compatibility of statistical computing language R
Spark DataFrame API:
Spark also has real time query engine which is able to query data in a quite real time manner. To access that engine it has the DataFrame APIs in Scala, Java and Python language.
Spark Ecosystems Layer:
Spark Ecosystem Components are the additional libraries operating on top of Spark Core and DataFrames. These components give the enrichment in the areas of SQL capabilities, machine learning, real time big data computation etc. Following are the main components of Spark Ecosystem.
- Component on top of Spark Core with new RDD abstraction called SchemaRDD(I’ll explain RDD in my post on Apache Spark RDD)
- Exposes Spark DataFrames through JDBC APIs and supports structured and semi-structured data
- Provides SQL like interface over DataFrames to query data in CSV, JSON, Sequence and Parquet file formats
Spark Machine Learning (MLlib):
- A common Machine Learning libraries for distributed, scalable and in memory computation
- Considerably faster than Apache Mahout in Hadoop MapReduce
- Supports common learning algorithms like dimension reduction, clustering, classification, regression, collaborative filtering etc..
- Adds capability of processing streaming data near to real time
- Capable of ingesting data in micro batches (in the form of micro RDDs) and performs transformation on series of micro batches(RDDs)
GraphX (Recently added):
- Provides distributed graph processing APIs on top of Spark Core
- Allows user defined graph modeling with Pregel abstraction API
BlinkDB (Recently added):
- Approximate query engine over large volume data
- Allows to execute interactive SQL over the large volume data which returns approximate results
- Capable of executing queries faster with potential errors in aggregated values
- Useful in case of data insights where accuracy is not mandatory
Tachyon (Recently added):
- It is an in-memory distributed file system
- Enables faster file sharing across the cluster as there is no overhead of disk IO
- It caches frequently read file in memory so that scheduled job can read shared files directly from cache and can execute faster
- Can be used for in memory file sharing with MapReduce and Spark jobs
Spark Resource Manager Layer:
As Apache Hadoop, Apache Spark doesn’t comes up with Resource Management module like YARN. It manage resource in standalone mode in single node cluster setup. But for distributed cluster mode it can be integrated with resource management modules like YARN or Mesos.
Apache Spark comes up with these interesting capabilities. And now the questions is..
How Apache Spark works in Distributed Cluster Mode?
Here is a high level view of Apache Spark Cluster.
So major points to know in above image are,
- Spark Cluster is also a Master Slave Architecture with two main Processes
- Master Daemon(Driver/Master process) which executes on the Spark Master node
- Driver processes manages executor processes via external resource manager (like YARN or Mesos)
- Worker Daemon(Slave process) which executes on the Spark Master node
- Normally to leverage Data Locality, this process would run on Hadoop Data Node where data is physically available
- Every cluster process spawn in a separate JVM
Let’s have a look at the full picture of Apache Spark Distributed Cluster
Above Spark cluster setup is distributed with YARN
- On the Master node, Spark Master daemon is running which will create the
SparkContextobject and share it across the slave node (where tasks are distributed)
- On the Spark Slave nodes, Worker daemon is running which keeps the track of executor process(s) on that node
- YARN Resource Manager daemon may reside on Master node(in small clusters) or may be on a separate machine (on large cluster)Between Master and Slave nodes
- YARN Resource Manager keeps the track of resources of each Node Manager
- YARN Node Manager manages the resources on the Slave node (on the same node where Executors and Data Node daemons are executed)
- Each slave node may have one/multiple executors based resources availability and each executor can have multiple task process based on scheduling
- Each task is executed in a separate JVM under it’s executor
How Job is executed on Spark Cluster?
- When driver process submits job, it sends the request to the YARN Resource manager first
- YARN resource manager checks for data locality and find the best available slave nodes for task scheduling
- Than job splits into different stages, each stage splits into tasks based on data locality and resources
- Prior to task execution, driver daemon sends necessary job details to each node
- Driver keeps track of currently executing task and updates the job monitoring status on master node (it can be checked with Master Node UI)
- Once job is completed, all the nodes share the aggregate values to the master node
I hope you like this article. Let me know your thoughts on this article and stay tuned for more such articles…..!!!!! 🙂