In any data science/data analysis work, the first step is to read CSV file (with pandas library). Pandas read_csv function is popular to load any CSV file in pandas. In this post we’ll explore various options of pandas read_csv function.
combineByKey RDD transformation is very similar to combiner in Hadoop MapReduce programming. In this post, we’ll discuss spark combineByKey example in depth and try to understand the importance of this function in detail. Continue reading “Apache Spark combineByKey Example”
Transposing numpy array is extremely simple using
np.transpose function. Fundamentally, transposing numpy array only make sense when you have array of 2 or more than 2 dimensions.
In post, we’ll learn to create pandas dataframe from python lists and dictionary objects. Creating pandas dataframe is fairly simple and basic step for Data Analysis. There are also other ways to create dataframe (i.e. from csv, excel files or even from databases queries). But we’ll cover other steps in other posts.
In python, reshaping numpy array can be very critical while creating a matrix or tensor from vectors. In order to reshape numpy array of one dimension to n dimensions one can use
np.reshape() method. Let’s check out some simple examples.
This post will give you a better hands on with creating numpy array. At the end of the post, you will have clarity on different ways of creating numpy arrays with helpful visualizations. If you are a beginner in Data Analytics or Data Science field, you must have in depth understanding of numpy package of python.
In this Spark aggregateByKey example post, we will discover how aggregationByKey could be a better alternative of groupByKey transformation when aggregation operation is involved. The most common problem while working with key-value pairs is grouping of values and aggregating them with respect to a common key. And Spark aggregateByKey transformation decently addresses this problem in a very intuitive way.
Apache Spark groupByKey example is quite similar as reduceByKey. It is again a transformation operation and also a wider operation because it demands data shuffle. Looking at spark groupByKey function it takes key-value pair (K,V) as an input produces RDD with key and list of values. Let’s try to understand the function in detail. At the end of this post we’ll also compare it with reduceByKey with respect to optimization technique.
Spark groupBy example can also be compared with groupby clause of SQL. In spark, groupBy is a transformation operation. Let’s have some overview first then we’ll understand this operation by some examples in Scala, Java and Python languages. Continue reading “Apache Spark groupBy Example”