Apache Spark reduce Example

Here in spark reduce example, we’ll understand how reduce operation works in Spark with examples in languages like Scala, Java and Python. Spark reduce operation is an action kind of operation and it triggers a full DAG execution for all lined up lazy instructions.

Spark RDD reduce function reduces the elements of this RDD using the specified commutative and associative binary operator.

Spark reduce operation is almost similar as reduce method in Scala. It is an action operation of RDD which means it will trigger all the lined up transformation on the base RDD (or in the DAG) which are not executed and than execute the action operation on the last RDD. This operation is also a wide operation. In the sense the execution of this operation results in distributing the data across the multiple partitions.

It accepts a function with (which accepts two arguments and returns a single element) which should be Commutative and Associative in mathematical nature. That intuitively means, this function produces same result when repetitively applied on same set of RDD data with multiple partitions irrespective of element’s order.

Apache Spark reduce example
Apache Spark reduce example

In above image you can see that are doing cumulative sum of numbers from 1 to 10 using reduce function. Here reduce method accepts a function (accum, n) => (accum + n). This function initialize accum variable with default integer value 0, adds up an element every when reduce method is called and returns final value when all elements of RDD X are processed. It returns the final value rather than another RDD.

Important points to note are,

  • reduce is an action operation in Spark hence it triggers execution of DAG and gets execute on final RDD
  • It is a wide operation as it is shuffling data from multiple partitions and reduces to a single value
  • It accepts a Commutative and Associative function as an argument
    • The parameter function should have two arguments of the same data type
    • The return type of the function also must be same as argument types

Let’s see some examples,

Spark reduce Example Using Scala

Spark reduce Example Using Java 8

Spark reduce Example Using Python

References:

4 comments

  1. Hi Can you please explain what is wide Operation ?

    1. Hi Chandresh,

      I have explained Narrow and Wide Operations in this post….

    2. Hi,
      Why number 2 is mentioned in the below line ? I changed it to 3 and then ran the job. but I didn’t find any difference in the output.

      scala> val x = sc.parallelize(1 to 10, 2)

      Thanks

      1. 2 stands for parallelism(i.e. number of independent tasks). You will have 2 tasks for your RDD for processing.

Leave a Reply

Your email address will not be published. Required fields are marked *