Apache Spark RDD Operations: Transformation and Action

We have already discussed about Spark RDD in my post Apache Spark RDD : The Bazics. In this post we’ll learn about Spark RDD Operations in detail. As we know Spark RDD is distributed collection of data and it supports two kind of operations on it Transformations and Actions.

Apache Spark RDD Operations
  • Transformations
  • Actions

Transformation Operations

Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like map, filter, flatMap are transformations.

Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.

Action Operations

This kind of operation will also give you another RDD but this operation will trigger all the lined up transformation on the base RDD (or in the DAG) and than execute the action operation on the last RDD. Operations like collect, count, first, saveAsTextFile are actions.

Let’s take one example for better understanding…..

val logFile = "hdfs://master.backtobazics.com:9000/user/root/sample.txt"
val lineRDD = sc.textFile(logFile)
//Transformation 1 -> DAG created
//{DAG: Start -> [sc.textFile(logFile)]}

val wordRDD = lineRDD.flatMap(_.split(" ")) 
//Transformation 2 -> wordRDD DAG updated
//{DAG: Start -> [sc.textFile(logFile)] 
//              -> [lineRDD.flatMap(_.split(" "))]}

val filteredWordRDD = wordRDD.filter(_.equalsIgnoreCase("the")) 
//Transformation 3 -> filteredWordRDD DAG updated
//{DAG: Start -> [sc.textFile(logFile)] 
//              -> [lineRDD.flatMap(_.split(" "))]
//                -> [wordRDD.filter(_.equalsIgnoreCase("the"))]}

//Action: collect
//Execute DAG & collect result to driver node

What is Lineage and how it is useful?

We have talked about dependency of RDDs in transformation operation. According to that one RDD may be dependent on none or more than one RDDs. So eventually it will create one directed acyclic graph(DAG) from start to end which is referred as lineage in spark.

Lineage is a very important aspect for fault tolerance in Apache Spark. Execution of any operation in spark distributed to various node and when any node goes down or executor process on any node is crashed than Spark automatically reschedules the killed task to another suitable node and recovers the intermediate state of the down node using this lineage. It will relaunches all operations from lineage and computes the intermediate data same as that was in dead node.

Narrow & Wide Operations

Spark RDD is the collection of references to the various partitions distributed across the cluster. Spark RDD operations can also be categorized in two categories narrow operations and wide operations based intermediate data shuffling between the partitions.

Narrow Operations

RDD operations like map, union, filter can operate on a single partition and map the data of that partition to resulting single partition. These kind of operations which maps data from one to one partition are referred as Narrow operations. Narrow operations doesn’t required to distribute the data across the partitions.

Spark RDD Narrow Operations
Spark RDD Narrow Operations without co-partitioned data
Spark RDD Narrow Operations Co-Partitioned
Spark RDD Narrow Operations with co-partitioned data

Let’s have a look at above figure for narrow RDD operations. You can see that the data subsets from the base RDDs partitions are mapped to only one partition of the new RDD.

 Wide Operations

RDD operations like groupByKey, distinct, join may require to map the data across the partitions in new RDD. These kind of operations which maps data from one to many partitions are referred as Wide operations. Narrow operations doesn’t required to distribute the data across the partitions. In most of the cases Wide operations distribute the data across the partitions. These considered to be more costly than narrow operations due to data shuffling.

Spark RDD Wide Operations
Spark RDD Wide Operations

Let’s have a look at above figure for wide RDD operations. You can see that the data subsets from the base RDDs are shuffled and divided to multiple partitions of the new RDD.

I’ll try to explain most common operations on Spark RDDs with some nice graphics in my upcoming post.

Thank you for your time and Stay tuned for such new posts…..!!!!! :)

One thought on “Apache Spark RDD Operations: Transformation and Action”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>