Whenever we hear about Java 8 Streams the first thing which strike in mind would be Java I/O stream and classes like
OutputStream. But here in Java 8 Stream is quite different concept than Java I/O Streams. Java 8 Stream best implementation of functional programming which can be used easily with Java Collection Framework.
Java 8 Functional Interface concept was introduced by Java to support Lambda Expressions which are very similar to Scala with the fact that you can assign Lambda Expression to a Functional Interface. Let’s explore the details.
Prior to the configuration of Hive with MySQL metastore, let’s know some important things about Apache Hive and it’s metastore. Apache Hive Metastore is normally configured with Derby Database. But that setting is recommended just for the testing or ad-hoc development purpose. When hive is used in production, its metastore should be configured in databases like MySQL or Postgres.
We have already discussed about Spark RDD in my post Apache Spark RDD : The Bazics. In this post we’ll learn about Spark RDD Operations in detail. As we know Spark RDD is distributed collection of data and it supports two kind of operations on it Transformations and Actions.
RDD stands for Resilient Distributed Dataset. Apache Spark RDD is an abstract representation of the data which is divided into the partitions and distributed across the cluster. If you are aware about collection framework in Java than you can consider an RDD same as the Java collection object but here it is divided into various small pieces (referred as partitions) and is distributed across multiple nodes.
Apache Spark architecture enables to write computation application which are almost 10x faster than traditional Hadoop MapReuce applications. We have already discussed about features of Apache Spark in the introductory post.
Prior to Introduction to Apache Spark, it is necessary that we understand the actual requirement of Apache Spark. So let’s rewind to the earlier architecture of distributed data processing for big data analytics. And the most famous algorithm for large scale data processing is Hadoop MapReduce. Hadoop MapRecuce solves certain problems for distributed computation but it has it’s own limitations when it comes to data scale and processing time.
Before we learn to install Apache Hive on CentOS let me give you the introduction of it. Hive is basically a data warehouse tool to store and process the structured data residing on HDFS. Hive was developed by Facebook and than after it is shifted to Apache Software Foundation and became an open source Apache Hive.
Normally we create Spark Application JAR using Scala and SBT (Scala Build Tool). In my previous post on Creating Multi-node Spark Cluster we have executed a word count example using spark shell. As an extension to that, we’ll learn about How to create Spark Application JAR file with Scala and SBT? and How to execute it as a Spark Job on Spark Cluster?
Hadoop 1.x Architecture is a history now because in most of the Hadoop applications are using Hadoop 2.x Architecture. But still understanding of Hadoop 1.x Architecture will provide us the insights of how hadoop has evolved over the time.