Apache Spark map Example

In Apache Spark map example, we’ll learn about all ins and outs of map function. Basically map is defined in abstract class RDD in spark and it is a transformation kind of operation which means it is a lazy operation. Let’s explore it in detail.

Spark RDD map function returns a new RDD by applying a function to all elements of source RDD

Spark map itself is a transformation function which accepts a function as an argument. This function will be applied to the source RDD and eventually each elements of the source RDD and will create a new RDD as a resulting values. Let’s have a look at following image to understand it better.

Apache Spark map Example
Apache Spark map Example

As you can see in above image RDD X is the source RDD and RDD Y is a resulting RDD. If we recall our word count example in Spark, RDD X has the distributed array of the words, with the map transformation we are mapping each element with integer 1 and creating a tuple like (word, 1).

Important points to note are,

  • map is a transformation operation in Spark hence it is lazily evaluated
  • It is a narrow operation as it is not shuffling data from one partition to multiple partitions

Let’s take some examples,

Spark map Example Using Scala
// Basic map example in scala
scala> val x = sc.parallelize(List("spark", "rdd", "example",  "sample", "example"), 3)
scala> val y = x.map(x => (x, 1))
scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), (sample,1), (example,1))

// rdd y can be re writen with shorter syntax in scala as 
scala> val y = x.map((_, 1))
scala> y.collect
res0: Array[(String, Int)] = Array((spark,1), (rdd,1), (example,1), (sample,1), (example,1))

// Another example of making tuple with string and it's length
scala> val y = x.map(x => (x, x.length))
scala> y.collect
res0: Array[(String, Int)] = Array((spark,5), (rdd,3), (example,7), (sample,6), (example,7))
Spark map Example Using Java 8
// Basic map example in Java 8
package com.backtobazics.sparkexamples;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import scala.Tuple2;

public class MapExample {
    public static void main(String[] args) throws Exception {
        JavaSparkContext sc = new JavaSparkContext();
        // Parallelized with 2 partitions
        JavaRDD<String> x = sc.parallelize(
                Arrays.asList("spark", "rdd", "example", "sample", "example"),
        // Word Count Map Example
        JavaRDD<Tuple2<String, Integer>> y1 = x.map(e -> new Tuple2<>(e, 1));
        List<Tuple2<String, Integer>> list1 = y1.collect();
        // Another example of making tuple with string and it's length
        JavaRDD<Tuple2<String, Integer>> y2 = x.map(e -> new Tuple2<>(e, e.length()));
        List<Tuple2<String, Integer>> list2 = y2.collect();

Above example is in form of full class of Java 8, as Java doesn’t have REPL in Spark.

Spark map Example Using Python
# Bazic map example in python
>>> x = sc.parallelize(["spark", "rdd", "example",  "sample", "example"], 2)
>>> y = x.map(lambda x: (x,1))
>>> y.collect()
[('spark', 1), ('rdd', 1), ('example', 1), ('sample', 1), ('example', 1)]

# Another example of making tuple with string and it's length
>>> y = x.map(lambda x: (x,len(x)))
>>> y.collect()
[('spark', 5), ('rdd', 3), ('example', 7), ('sample', 6), ('example', 7)]

Above are very basic examples we’ll see more such examples in my upcoming posts.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>