Apache Spark reduce Example

Here in spark reduce example, we’ll understand how reduce operation works in Spark with examples in languages like Scala, Java and Python. Spark reduce operation is an action kind of operation and it triggers a full DAG execution for all lined up lazy instructions.

Spark RDD reduce function reduces the elements of this RDD using the specified commutative and associative binary operator.

Spark reduce operation is almost similar as reduce method in Scala. It is an action operation of RDD which means it will trigger all the lined up transformation on the base RDD (or in the DAG) which are not executed and than execute the action operation on the last RDD. This operation is also a wide operation. In the sense the execution of this operation results in distributing the data across the multiple partitions.

It accepts a function with (which accepts two arguments and returns a single element) which should be Commutative and Associative in mathematical nature. That intuitively means, this function produces same result when repetitively applied on same set of RDD data with multiple partitions irrespective of element’s order.

Apache Spark reduce example
Apache Spark reduce example

In above image you can see that are doing cumulative sum of numbers from 1 to 10 using reduce function. Here reduce method accepts a function (accum, n) => (accum + n). This function initialize accum variable with default integer value 0, adds up an element every when reduce method is called and returns final value when all elements of RDD X are processed. It returns the final value rather than another RDD.

Important points to note are,

  • reduce is an action operation in Spark hence it triggers execution of DAG and gets execute on final RDD
  • It is a wide operation as it is shuffling data from multiple partitions and reduces to a single value
  • It accepts a Commutative and Associative function as an argument
    • The parameter function should have two arguments of the same data type
    • The return type of the function also must be same as argument types

Let’s see some examples,

Spark reduce Example Using Scala
// reduce numbers 1 to 10 by adding them up
scala> val x = sc.parallelize(1 to 10, 2)
scala> val y = x.reduce((accum,n) => (accum + n)) 
y: Int = 55

// shorter syntax
scala> val y = x.reduce(_ + _) 
y: Int = 55

// same thing for multiplication
scala> val y = x.reduce(_ * _) 
y: Int = 3628800
Spark reduce Example Using Java 8
package com.backtobazics.sparkexamples;

import java.util.Arrays;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;

public class ReduceExample {
    public static void main(String[] args) throws Exception {
        JavaSparkContext sc = new JavaSparkContext();
        //Reduce Function for cumulative sum
        Function2<Integer, Integer, Integer> reduceSumFunc = (accum, n) -> (accum + n);
        //Reduce Function for cumulative multiplication
        Function2<Integer, Integer, Integer> reduceMulFunc = (accum, n) -> (accum * n);
        // Parallelized with 2 partitions
        JavaRDD<Integer> rddX = sc.parallelize(
                Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
        // cumulative sum
        Integer cSum = rddX.reduce(reduceSumFunc);
        // another way to write
        Integer cSumInline = rddX.reduce((accum, n) -> (accum + n));
        // cumulative multiplication
        Integer cMul = rddX.reduce(reduceMulFunc);
        // another way to write
        Integer cMulInline = rddX.reduce((accum, n) -> (accum * n));
        System.out.println("cSum: " + cSum + ", cSumInline: " + cSumInline + 
                "\ncMul: " + cMul + ", cMulInline: " + cMulInline);

// Output:
// cSum: 55, cSumInline: 55
// cMul: 3628800, cMulInline: 3628800

Spark reduce Example Using Python
# reduce numbers 1 to 10 by adding them up
>>> x = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
>>> cSum = x.reduce(lambda accum, n: accum + n)
>>> print(cSum)

# reduce numbers 1 to 10 by multiplying them
>>> cMul = x.reduce(lambda accum, n: accum * n)
>>> print(cMul)

# by defining a lambda reduce function 
>>> def cumulativeSum(accum, n):
...     return accum + n
>>> cSum = x.reduce(cumulativeSum)
>>> print(cSum)


2 thoughts on “Apache Spark reduce Example”

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>