Spark groupBy example can also be compared with groupBy of SQL. In spark, groupBy is a transformation operation. Let’s have some overview first then we’ll understand this operation by some examples in Scala, Java and Python languages.
Spark flatMap example is mostly similar operation with RDD map operation. It is also defined in RDD abstract class of spark core library and same as map it also is a transformation kind of operation hence it is lazily evaluated.
We have already discussed about Spark RDD in my post Apache Spark RDD : The Bazics. In this post we’ll learn about Spark RDD Operations in detail. As we know Spark RDD is distributed collection of data and it supports two kind of operations on it Transformations and Actions.
RDD stands for Resilient Distributed Dataset. Apache Spark RDD is an abstract representation of the data which is divided into the partitions and distributed across the cluster. If you are aware about collection framework in Java than you can consider an RDD same as the Java collection object but here it is divided into various small pieces (referred as partitions) and is distributed across multiple nodes.
Apache Spark architecture enables to write computation application which are almost 10x faster than traditional Hadoop MapReuce applications. We have already discussed about features of Apache Spark in the introductory post.
Prior to Introduction to Apache Spark, it is necessary that we understand the actual requirement of Apache Spark. So let’s rewind to the earlier architecture of distributed data processing for big data analytics. And the most famous algorithm for large scale data processing is Hadoop MapReduce. Hadoop MapRecuce solves certain problems for distributed computation but it has it’s own limitations when it comes to data scale and processing time.