Apache Spark groupBy Example

Spark groupBy example can also be compared with groupby clause of SQL. In spark, groupBy is a transformation operation. Let’s have some overview first then we’ll understand this operation by some examples in Scala, Java and Python languages.

Spark RDD groupBy function returns an RDD of grouped items.

Spark groupBy function is defined in RDD class of spark. It is a transformation operation which means it will follow lazy evaluation. We need to pass one function (which defines a group for an element) which will be applied to the source RDD and will create a new RDD as with the individual groups and the list of items in that group. This operation is a wide operation as data shuffling may happen across the partitions.

This operation will return the new RDD which basically is made up with a KEY (which is a group) and list of items of that group (in a form of Iterator). Order of element within the group may not same when you apply the same operation on the same RDD over and over. Additionally, this operation could be computationally costly when you are trying to perform some aggregation on grouped items.

Apache Spark groupBy Example
Apache Spark groupBy Example

In above image you can see that RDD X contains different words with 2 partitions. It accepts a function word => word.charAt(0) which will get the first character of the word in upper case (which will be considered as a group). The output RDD Y which will contain the group(first character of the word) as a key and Iterator of all words belong to that group (all words starting from a character).

Important points

  • groupBy is a transformation operation in Spark hence its evaluation is lazy
  • It is a wide operation as it shuffles data from multiple partitions and create another RDD
  • This operation is costly as it doesn’t us combiner local to a partition to reduce the data transfer
  • Not recommended to use when you need to do further aggregation on grouped data

This function has three variants

  1. groupBy(function)
  2. groupBy(function, [numPartition])
  3. groupBy(partitioner, function)
  • First variant will generate hash-partitioned output with existing partitioner
  • Second variant will generate hash-partitioned output with number of partitions given by numPartition
  • And finally third variant will generate output using Partitioner object referenced by partitioner

Let’s have some examples,

Spark groupBy Example Using Scala

Spark groupBy Example Using Java 8

PySpark groupBy Example


Leave a Reply

Notify of