Apache Spark groupByKey Example

Apache Spark groupByKey example is quite similar as reduceByKey. It is again a transformation operation and also a wider operation because it demands data shuffle. Looking at spark groupByKey function it takes key-value pair (K,V) as an input produces RDD with key and list of values. Let’s try to understand the function in detail. At the end of this post we’ll also compare it with reduceByKey with respect to optimization technique.

Spark RDD groupByKey function collects the values for each key in a form of an iterator

As name suggest groupByKey function in Apache Spark just groups all values with respect to a single key. Unlike reduceByKey it doesn’t per form any operation on final output. It just group the data and returns in a form of an iterator. It is a transformation operation which means its evaluation is lazy.

Now because in source RDD multiple keys can be there in any partition, this function require to shuffle all data with of a same key to a single partition unless your source RDD is already partitioned by key. And this shuffling makes this transformation as a wider transformation.

It is slightly different than groupBy() transformation as it requires Key Value pair whereas in groupBy() you may or may not have keys in source RDD. Transformation function groupBy() also needs a function to form a key which is not needed in case of spark groupByKey() function.

Apache Spark groupByKey Example
Apache Spark groupByKey Example

Important Points

  • Apache spark groupByKey is a transformation operation hence its evaluation is lazy
  • It is a wide operation as it shuffles data from multiple partitions and create another RDD
  • This operation is costly as it doesn’t use combiner local to a partition to reduce the data transfer
  • Not recommended to use when you need to do further aggregation on grouped data
  • groupByKey always results in Hash-Partitioned RDDs

This function has three variants

  1. groupByKey()
  2. groupByKey(numPartition)
  3. groupByKey(partitioner)
  • First variant groups the values for each key in the RDD into a single sequence
  • Second variant accepts the arguments for partitions in result RDD
  • And third variant uses partitioner for creating partitions in result RDD

Spark groupByKey Example Using Scala

PySpark groupByKey Example

Stay tuned for more interesting posts…..!!!!!

Leave a Reply

Your email address will not be published. Required fields are marked *