We have already discussed about Spark RDD in my post Apache Spark RDD : The Bazics. In this post we’ll learn about Spark RDD Operations in detail. As we know Spark RDD is distributed collection of data and it supports two kind of operations on it Transformations and Actions.
RDD stands for Resilient Distributed Dataset. Apache Spark RDD is an abstract representation of the data which is divided into the partitions and distributed across the cluster. If you are aware about collection framework in Java than you can consider an RDD same as the Java collection object but here it is divided into various small pieces (referred as partitions) and is distributed across multiple nodes.
Apache Spark architecture enables to write computation application which are almost 10x faster than traditional Hadoop MapReuce applications. We have already discussed about features of Apache Spark in the introductory post.
Prior to Introduction to Apache Spark, it is necessary that we understand the actual requirement of Apache Spark. So let’s rewind to the earlier architecture of distributed data processing for big data analytics. And the most famous algorithm for large scale data processing is Hadoop MapReduce. Hadoop MapRecuce solves certain problems for distributed computation but it has it’s own limitations when it comes to data scale and processing time.
Before we learn to install Apache Hive on CentOS let me give you the introduction of it. Hive is basically a data warehouse tool to store and process the structured data residing on HDFS. Hive was developed by Facebook and than after it is shifted to Apache Software Foundation and became an open source Apache Hive.
Being a purely object oriented language each and every value in Scala world is an object. Even all of the primitive data types of Scala are the objects. And as I also has mentioned in my earlier post related to Scala Methods, there is no native operator available in Scala. Instead, Scala uses a corresponding method which look like an operator.
The word “Scala” is a short form of the term “Scalable Language”. Technically, Scala programming is a combination of the two style of programming approaches and those are “Object Oriented Programming” and “Functional Programming”. The Object Oriented approach makes language very easy to build large applications with big structures and architecture and Functional approach keeps design very modular and pluggable. This is how Scala language is empowered by fusion of both of these approaches.
Before this post we have discussed about what is Hadoop and what kind of issues are solved by Hadoop. Now Let’s deep dive in to various components of Hadoop. Hadoop as a whole distribution provides only two core components and HDFS (which is Hadoop Distributed File System) and MapReduce (which is a distributed batch processing framework). And a complete bunch of machines which are running HDFS and MapReduce are known as Hadoop Cluster.