Apache Spark groupByKey example is quite similar as reduceByKey. It is again a transformation operation and also a wider operation because it demands data shuffle. Looking at spark groupByKey function it takes key-value pair (K,V) as an input produces RDD with key and list of values. Let’s try to understand the function in detail. At the end of this post we’ll also compare it with reduceByKey with respect to optimization technique.
Spark RDD groupByKey function collects the values for each key in a form of an iterator
As name suggest groupByKey function in Apache Spark just groups all values with respect to a single key. Unlike reduceByKey it doesn’t per form any operation on final output. It just group the data and returns in a form of an iterator. It is a transformation operation which means its evaluation is lazy.
Now because in source RDD multiple keys can be there in any partition, this function require to shuffle all data with of a same key to a single partition unless your source RDD is already partitioned by key. And this shuffling makes this transformation as a wider transformation.
It is slightly different than groupBy() transformation as it requires Key Value pair whereas in groupBy() you may or may not have keys in source RDD. Transformation function groupBy() also needs a function to form a key which is not needed in case of spark groupByKey() function.

Important Points
- Apache spark groupByKey is a transformation operation hence its evaluation is lazy
- It is a wide operation as it shuffles data from multiple partitions and create another RDD
- This operation is costly as it doesn’t use combiner local to a partition to reduce the data transfer
- Not recommended to use when you need to do further aggregation on grouped data
- groupByKey always results in Hash-Partitioned RDDs
This function has three variants
- groupByKey()
- groupByKey(numPartition)
- groupByKey(partitioner)
- First variant groups the values for each key in the RDD into a single sequence
- Second variant accepts the arguments for partitions in result RDD
- And third variant uses partitioner for creating partitions in result RDD
Spark groupByKey Example Using Scala
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | // Bazic groupByKey example in scala val x = sc.parallelize(Array(("USA", 1), ("USA", 2), ("India", 1), | ("UK", 1), ("India", 4), ("India", 9), | ("USA", 8), ("USA", 3), ("India", 4), | ("UK", 6), ("UK", 9), ("UK", 5)), 3) // x: org.apache.spark.rdd.RDD[(String, Int)] = // ParallelCollectionRDD[0] at parallelize at <console>:24 // groupByKey with default partitions val y = x.groupByKey // y: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = // ShuffledRDD[1] at groupByKey at <console>:26 // Check partitions y.getNumPartitions // res0: Int = 3 // With Predefined Partitions val y = x.groupByKey(2) // y: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = // ShuffledRDD[3] at groupByKey at <console>:26 y.getNumPartitions // res1: Int = 2 y.collect // res2: Array[(String, Iterable[Int])] = // Array((UK,CompactBuffer(1, 6, 9, 5)), // (USA,CompactBuffer(1, 2, 8, 3)), // (India,CompactBuffer(1, 4, 9, 4))) |
PySpark groupByKey Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ## Bazic groupByKey example in python x = sc.parallelize([ ("USA", 1), ("USA", 2), ("India", 1), ("UK", 1), ("India", 4), ("India", 9), ("USA", 8), ("USA", 3), ("India", 4), ("UK", 6), ("UK", 9), ("UK", 5)], 3) ## groupByKey with default partitions y = x.groupByKey() ## Check partitions print('Output: ',y.getNumPartitions()) ## Output: 3 ## With predefined Partitions y = x.groupByKey(2) print('Output: ',y.getNumPartitions()) ## Output: 2 ## Print Output for t in y.collect(): print(t[0], [v for v in t[1]]) ## USA [1, 2, 8, 3] ## India [1, 4, 9, 4] ## UK [1, 6, 9, 5] |