Apache Spark RDD Operations: Transformation and Action

We have already discussed about Spark RDD in my post Apache Spark RDD : The Bazics. In this post we’ll learn about Spark RDD Operations in detail. As we know Spark RDD is distributed collection of data and it supports two kind of operations on it Transformations and Actions. Continue reading “Apache Spark RDD Operations: Transformation and Action”

Apache Spark RDD : The Bazics

RDD stands for Resilient Distributed Dataset. Apache Spark RDD is an abstract representation of the data which is divided into the partitions and distributed across the cluster. If you are aware about collection framework in Java than you can consider an RDD same as the Java collection object but here it is divided into various small pieces (referred as partitions) and is distributed across multiple nodes. Continue reading “Apache Spark RDD : The Bazics”

Introduction to Apache Spark

Prior to Introduction to Apache Spark, it is necessary that we understand the actual requirement of Apache Spark. So let’s rewind to the earlier architecture of distributed data processing for big data analytics. And the most famous algorithm for large scale data processing is Hadoop MapReduce. Hadoop MapRecuce solves certain problems for distributed computation but it has it’s own limitations when it comes to data scale and processing time. Continue reading “Introduction to Apache Spark”

Building Spark Application JAR using Scala and SBT

Normally we create Spark Application JAR using Scala and SBT (Scala Build Tool). In my previous post on Creating Multi-node Spark Cluster we have executed a word count example using spark shell. As an extension to that, we’ll learn about How to create Spark Application JAR file with Scala and SBT? and How to execute it as a Spark Job on Spark Cluster? Continue reading “Building Spark Application JAR using Scala and SBT”

6 Steps to Setup Apache Spark 1.0.1 (Multi Node Cluster) on CentOS

Before we move ahead lets learn a bit on Setup Apache Spark,

So, What is Apache Spark?

Apache Spark is a fast, real time and extremely expressive computing system which executes job in distributed (clustered) environment.

It is quite compatible with Apache Hadoop and more almost 10x faster than Hadoop MapReduce on Disk Computing and 100x faster using in memory computations. It provides rich APIs in Java, Scala and Python along with Functional Programming capabilities. Continue reading “6 Steps to Setup Apache Spark 1.0.1 (Multi Node Cluster) on CentOS”