Category Archives: Big Data

Apache Spark RDD : The Bazics

RDD stands for Resilient Distributed Dataset. Apache Spark RDD is an abstract representation of the data which is divided into the partitions and distributed across the cluster. If you are aware about collection framework in Java than you can consider an RDD same as the Java collection object but here it is divided into various small pieces (referred as partitions) and is distributed across multiple nodes.

Introduction to Apache Spark

Prior to Introduction to Apache Spark, it is necessary that we understand the actual requirement of Apache Spark. So let’s rewind to the earlier architecture of distributed data processing for big data analytics. And the most famous algorithm for large scale data processing is Hadoop MapReduce. Hadoop MapRecuce solves certain problems for distributed computation but it has it’s own limitations when it comes to data scale and processing time.

Simple explanation of Hadoop Core Components : HDFS and MapReduce

Before this post we have discussed about what is Hadoop and what kind of issues are solved by Hadoop. Now Let’s deep dive in to various components of Hadoop. Hadoop as a whole distribution provides only two core components and HDFS (which is Hadoop Distributed File System) and MapReduce (which is a distributed batch processing framework). And a complete bunch of machines which are running HDFS and MapReduce are known as Hadoop Cluster.

As you add more nodes in Hadoop Cluster the performance of your cluster will increase which means that Hadoop is Horizontally Scalable.