Apache Spark flatMap Example

Spark flatMap example is mostly similar operation with RDD map operation. It is also defined in RDD abstract class of spark core library and same as map it also is a transformation kind of operation hence it is lazily evaluated.

Spark RDD flatMap function returns a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

Spark flatMap is a transformation operation of RDD which accepts a function as an argument. Same as flatMap, this function will be applied to the source RDD and eventually each elements of the source RDD and will create a new RDD as a resulting values. One step more than RDD map operation, it accepts the argument function which returns array, list or sequence of elements instead of a single element. As a final result it flattens all the elements of the resulting RDD in case individual elements are in form of list, array, sequence or any such collection. Let’s check it’s behavior from following image.

Apache Spark flatMap Example
Apache Spark flatMap Example

As you can see in above image RDD X is the source RDD and RDD Y is a resulting RDD. As per our typical word count example in Spark, RDD X is made up of individual lines/sentences which is distributed in various partitions, with the flatMap transformation we are extracting separate array of words from sentence. But instead of array flatMap function will return the RDD with individual words rather than RDD with array of words.

Important points to note are,

  • flatMap is a transformation operation in Spark hence it is lazily evaluated
  • It is a narrow operation as it is not shuffling data from one partition to multiple partitions
  • Output of flatMap is flatten
  • flatMap parameter function should return array, list or sequence (any subtype of scala.TraversableOnce)

Let’s take some examples,

Spark flatMap Example Using Scala

Spark flatMap Example Using Java 8

PySpark flatMap Example

References:

Leave a Reply

Your email address will not be published. Required fields are marked *