Spark groupBy example can also be compared with groupBy of SQL. In spark, groupBy is a transformation operation. Let’s have some overview first then we’ll understand this operation by some examples in Scala, Java and Python languages.
Spark flatMap example is mostly similar operation with RDD map operation. It is also defined in RDD abstract class of spark core library and same as map it also is a transformation kind of operation hence it is lazily evaluated.
We have discussed a high level view of YARN Architecture in my post on Understanding Hadoop 2.x Architecture but YARN it self is a wider subject to understand. Keeping that in mind, we’ll about discuss YARN Architecture, it’s components and advantages in this post.
Prior to learn the concepts of Hadoop 2.x Architecture, I strongly recommend you to refer the my post on Hadoop Core Components, internals of Hadoop 1.x Architecture and its limitations. It will give you the idea about Hadoop2 Architecture requirement. And we have already learnt about the basic Hadoop components like Name Node, Secondary Name Node, Data Node, Job Tracker and Task Tracker.
Prior to the configuration of Hive with MySQL metastore, let’s know some important things about Apache Hive and it’s metastore. Apache Hive Metastore is normally configured with Derby Database. But that setting is recommended just for the testing or ad-hoc development purpose. When hive is used in production, its metastore should be configured in databases like MySQL or Postgres.
We have already discussed about Spark RDD in my post Apache Spark RDD : The Bazics. In this post we’ll learn about Spark RDD Operations in detail. As we know Spark RDD is distributed collection of data and it supports two kind of operations on it Transformations and Actions.