Introduction to Apache Spark

Prior to Introduction to Apache Spark, it is necessary that we understand the actual requirement of Apache Spark. So let’s rewind to the earlier architecture of distributed data processing for big data analytics. And the most famous algorithm for large scale data processing is Hadoop MapReduce. Hadoop MapRecuce solves certain problems for distributed computation but it has it’s own limitations when it comes to data scale and processing time.

Apache Spark
Apache Spark

Eventually the big data exports around the world have derived the specialized systems on top of Hadoop to solve certain problems like graph processing, implementation of efficient iterative algorithms, real time query engines etc.. As you can see in the following figure all the other components like Impala, Mahout, Tez, GraphLab etc are derived from Hadoop for different purposes.

Specialized systems on top of Hadoop
Specialized systems on top of Hadoop

What is Apache Spark?

Apache spark is the generalized engine which combines the specialties of all the above components in to a single component and give you a common set of APIs to deal with problems solved by above components. So that, you don’t have to learn each of above components individually, just learn Spark and you are done.

Apache Spark is…

  • In memory computation engine (Doesn’t provide distributed storage and resource management capabilities)
  • Almost 100x faster than Hadoop MapReduce with in memory computations
  • Almost 10x faster than Hadoop MapReduce using computations with Disk IO

Designed for…

  • Faster batch processing
  • Systems which requires to implement iterative algorithm (i.e. Page Rank Algorithm)
  • Processing of Streaming Data
  • Applications requires almost real time and interactive query processing

Why to move away from MapReduce and switch to Spark?

There are some of the very strong reason on preferring Spark over MapReduce framework.

  • Quite faster in computation than MapReduce, as MapReduce is using Disk IO where Spark does computation with in memory data most of the time
  • MapReduce is very slow when Graph processing or implementing iterative algorithms
  • Spark programming can be done in a functional way which is quite modularized and handy for programmers
  • Spark simpler APIs for Streaming, Batch Processing, ad-hoc query engine, Machine Learning, Graph Processing etc.. So No need to learn other specialized frameworks
  • Writing Spark Application is very simpler as line of code is reduced with compare to MapReduced

Why Iterative processing is slow using Hadoop MapReduce?

It is not always the case with MapReduce that you just executed your job on HDFS data and you are done. There are certain use cases where you need to take HDFS data as an input execute your job which writes output on HDFS. And than after another MapReduce job get executed which uses the output of previous job.

Now think for a moment, every time when MapReduce job is executed it reads data from HDFS (eventually from disk) and writes output on HDFS (eventually on disk). And in case your job needs such multiple iterations, it will be very slow due to Disk IO at every iteration.

In case of Apache Spark, it keeps the output of your previous stage in memory for that in next iteration it can be retrieved from memory which is quite faster than Disk IO.

I hope you nJoyed this article. We’ll learn more about spark in my upcoming posts. Stay tuned…..


Leave a Reply

Notify of