Apache Spark combineByKey Example

Spark combineByKey RDD transformation is very similar to combiner in Hadoop MapReduce programming. In this post, we’ll discuss spark combineByKey example in depth and try to understand the importance of this function in detail.

Spark combineByKey is a transformation operation on PairRDD (i.e. RDD with key/value pair). It is a wider operation as it requires shuffle in the last stage. As we have seen earlier in reduceByKey example that it internally combines elements by partition. The same combiner kind behavior is there in combineByKey function.

Spark combineByKey is a generic function to combine the elements for each key using a custom set of aggregation functions

Internally spark combineByKey function efficiently combines the values of a PairRDD partition by applying aggregation function. The main objective of combineByKey transformation is transforming any PairRDD[(K,V)] to the RDD[(K,C)] where C is the result of any aggregation of all values under key K.

Let’s understand this transformation step by step.

Spark combineByKey argument functions

Spark combineByKey function uses following three functions as an argument,

  1. createCombiner
  2. mergeValue
  3. mergeCombiners

To start with let’s say we have a RDD of studentName, subjectName and marks and we want to get the percentage of all students. So following will be the steps of solving it with combineByKey transformation.

Apache Spark combineByKey Example
Apache Spark combineByKey Example

createCombiner function of combineByKey

  • This function is a first argument of combineByKey function
  • It is a first aggregation step for each key
  • It will be executed when any new key is found in a partition
  • Execution of this lambda function is local to a partition of a node, on each individual values

In our case as we are calculating percentage, we need sum and count as an aggregation. So our createCombiner function should initialize it with a tuple (sum, count). For initial aggregation, it should be (value, 1).

This function is similar to first argument (i.e. zeroVal) of aggregateByKey transformation.

mergeValue function of combineByKey

  • Second function executes when next subsequent value is given to combiner
  • It also executes locally on each partition of a node and combines all values
  • Arguments of this function are a accumulator and a new value
  • It combines a new value in existing accumulator

In our case mergeValue has one accumulator tuple (sum, count). So whenever we get a new value the marks will be added to first element and second value (i.e. counter) will be incremented by 1.

This function is similar to second argument (i.e. seqOp) of aggregateByKey transformation.

mergeCombiner function of combineByKey

  • Final function is used to combine how to merge two accumulators (i.e. combiners) of a single key across the partitions to generate final expected result
  • Arguments are two accumulators (i.e. combiners)
  • Merge results of a single key from different partitions

This function is similar to third argument (i.e. combOp) of aggregateByKey transformation.

Important Points

  • Apache spark combineByKey is a transformation operation hence its evaluation is lazy
  • It is a wider operation as it shuffles data in the last stage of aggregation and creates another RDD
  • Recommended to use when you need to do further aggregation on grouped data
  • Use combineByKey when return type differs than source type (i.e. when you cannot use reduceByKey )

 

Spark combineByKey Example in Scala

PySpark combineByKey Example

After looking at the examples, one will wonder what is the main difference between aggregateByKey and combineByKey transformations. Let’s find out..

Difference between aggregateByKey and combineByKey transformations

Spark combineByKey is a general transformation whereas internal implementation of transformations groupByKey, reduceByKey and aggregateByKey uses combineByKey.

aggregateByKey

  • In combineByKey, aggregations such as sum, avg, etc. are performed for each key to reduce the shuffle data
  • If your aggregation function supports that reduction you can safely use aggregateByKey operation
  • The scenario is referred as map side combine when you perform aggregation with respect to a key locally in a partition itself
  • Spark aggregateByKey is similar to reduceByKey but the return RDD type is different for aggregateByKeyand you can provide initial values when performing aggregation

combineByKey

  • Spark combineByKey transformation is flexible in performing map or reduce side combine
  • Usage of combineByKey transformation it is more complex
  • You always need to implement three functions: createCombiner, mergeValue, mergeCombiner

Hope you like this post. Stay tuned for more post on other spark Transformations and Actions.

Leave a Reply

Your email address will not be published. Required fields are marked *