Apache Spark reduceByKey Example

Looking at spark reduceByKey example, we can say that reduceByKey is one step ahead then reduce function in Spark with the contradiction that it is a transformation operation. Let’s understand this operation by some examples in Scala, Java and Python languages.

Spark RDD reduceByKey function merges the values for each key using an associative reduce function.

Basically reduceByKey function works only for RDDs which contains key and value pairs kind of elements(i.e RDDs having tuple or Map as a data element). It is a transformation operation which means it is lazily evaluated. We need to pass one associative function as a parameter, which will be applied to the source RDD and will create a new RDD as with resulting values(i.e. key value pair). This operation is a wide operation as data shuffling may happen across the partitions.

The associative function (which accepts two arguments and returns a single element) should be Commutative and Associative in mathematical nature. That intuitively means, this function produces same result when repetitively applied on same set of RDD data with multiple partitions irrespective of element’s order. Additionally, it performs merging locally using reduce function and than sends records across the partitions for preparing the final results.

Apache Spark reduceByKey Example
Apache Spark reduceByKey Example

In above image you can see that RDD X has set of multiple paired elements like (a,1) and (b,1) with 3 partitions. It accepts a function (accum, n) => (accum + n) which initialize accum variable with default integer value 0, adds up an element for each key and returns final RDD Y with total counts paired with key. And before shuffling the data across the partitions it does the same aggregation locally for each partition.

Important points to note are,

  • reduceByKey is a transformation operation in Spark hence it is lazily evaluated
  • It is a wide operation as it shuffles data from multiple partitions and creates another RDD
  • Before sending data across the partitions, it also merges the data locally using the same associative function for optimized data shuffling
  • It can only be used with RDDs which contains key and value pairs kind of elements
  • It accepts a Commutative and Associative function as an argument
    • The parameter function should have two arguments of the same data type
    • The return type of the function also must be same as argument types

This function has three variants

  1. reduceByKey(function)
  2. reduceByKey(function, [numPartition])
  3. reduceByKey(partitioner, function)
  • Variants 1 will generate hash-partitioned output with existing partitioner
  • Variants 2 will generate hash-partitioned output with number of partitions given by numPartition
  • Variants 3 will generate output using Partitioner object referenced by partitioner

All of above returns RDD with key value pair.

Let’s have some examples,

Spark reduceByKey Example Using Scala

Spark reduceByKey Example Using Java 8

PySpark reduceByKey Example


Leave a Reply

Notify of