Apache Spark RDD : The Bazics

RDD stands for Resilient Distributed Dataset. Apache Spark RDD is an abstract representation of the data which is divided into the partitions and distributed across the cluster. If you are aware about collection framework in Java than you can consider an RDD same as the Java collection object but here it is divided into various small pieces (referred as partitions) and is distributed across multiple nodes.

Let’s dive a bit deeper into this…

By definition, RDD – Resilient Distributed DataSets is a large collection of data/objects spread across the cluster. This collection is made up of data partitions which is a small collection of data stored in RAM or on Disk.

Now basic properties of RDD are,

  • RDD is immutable in nature
  • RDD is lazily evaluated
  • RDD is cacheable

Just remember above properties for now, we’ll discuss them later in this post.

Resilient Distributed Datasets (RDD) is a very fundamental part of Apache Spark to understand. And most of the programmers who are new to Apache Spark, are mostly confused about concept and working methodology of Spark RDD. So let’s make it very simple to understand.

Apache Spark RDD
Apache Spark RDD

Characteristics of Resilient Distributed Datasets(RDD)

  • RDD is an array of reference of partition objects
  • Partition is a basic unit of parallelism and each partition holds the reference to the subset of the data
  • All the partitions are assigned to the nodes of the cluster with respect to the data locality and/or with minimum data transfer
  • Before processing each partition is loaded in memory (RAM)

Ways of creating RDD

  1. By creating parallel collection object
    • This will create parallel collection on driver node, make partitions and distributes across cluster nodes (in memory)
  2. Creating RDD from the external sources like HDFS
    • This will create partitions per HDFS data block on nodes where data is physically available
  3. Executing any operation on an existing RDD
    • RDD is immutable in Spark, so whenever you apply any method on an existing RDD it will create a new RDD

For example…..

//file RDD distributed across the cluster
var fileRDD = sc.textFile(logFile) 

//new RDD named filteredRDD
var filteredRDD = fileRDD.filter(_.equals("FATAL")) 

In above example you can see the fileRDD is the RDD created from external source on HDFS. When you apply filter() method on fileRDD it will create a filteredRDD which is a new RDD because RDDs in Spark are immutable.

Secondly, filter method is the transformation. And filteredRDD will not return filtered data as soon as we execute second line in Spark REPL. As all the transformation are lazy in Spark, it will create the instruction graph (that is called Lineage) and will execute all the lined up instructions together when any Action method (like count(), collect() etc..) is executed on last RDD in the graph.

Refer to the more details in my post on Spark RDD Operations : Transformations & Actions.

I would like to have your inputs on this post. Kindly give me your feedback by posting a comment below.

5 thoughts on “Apache Spark RDD : The Bazics”

  1. I really enjoyed reading your blog on spark….The knowledge and understanding you have the concept is simply amazing.
    Please start writing about spark 2.0 and dataframes.

Leave a Reply to Varun Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>