Before this post we have discussed about what is Hadoop and what kind of issues are solved by Hadoop. Now Let’s deep dive in to various components of Hadoop. Hadoop as a whole distribution provides only two core components and HDFS (which is Hadoop Distributed File System) and MapReduce (which is a distributed batch processing framework). And a complete bunch of machines which are running HDFS and MapReduce are known as Hadoop Cluster.
As you add more nodes in Hadoop Cluster the performance of your cluster will increase which means that Hadoop is Horizontally Scalable.
HDFS – Hadoop Distributed File System (Storage Component)
HDFS is a distributed file system which stores the data in distributed manner. Rather than storing a complete file it divides a file into small blocks (of 64 or 128 MB size) and distributes them across the cluster. Each blocks is replicated(3 times as per default configuration) multiple times and is stored on different nodes to ensure data availability. Normally HDFS can be installed on native file systems like xfs, ext3 or ext4 (Similar to Unix/Linux file systems).
It cannot be installed on Windows directly. To run HDFS on your Windows machine You need to use tools like Cygwin.
You can write file and read file from HDFS. You cannot updated any file on HDFS. Recently Hadoop has added the support of appending content to the file which was not there in previous releases.
Here are some examples of HDFS commands.
Get list of all HDFS directories under
$ hdfs dfs -ls /user/root
Create a directory on HDFS under
$ hdfs dfs -mkdir /user/root/backtobazics
Copy file from current local directory to HDFS directory
$ hdfs dfs -copyFromLocal ./readme.txt /user/root/backtobazics/
View content of file from HDFS directory
$ hdfs dfs -cat /user/root/backtobazics/readme.txt
Delete a file or directory from HDFS directory
$ hdfs dfs -rm /user/root/backtobazics/readme.txt
MapReduce (Data Processing Component)
MapReduce is the algorithm of executing any task on distributed system. Using MapReduce one can process a large file in parallel manner. MapReduce framework executes any task on different nodes as full file is distributed across the cluster in a form of various blocks.
It has two phases, Map(Mapper Task) and Reduce (Reducer Task)
- Each of these tasks would run on individual blocks of the data
- First mapper task would take each line of elements as an input and generates intermediate key value pairs
- Each mapper task is executed on a single block of data
- Than reducer task will take list of key value pairs for same keys, process the data and generates the final output
- A phase called shuffle and sort will take place between mapper and reducer task will send the data to proper reducer tasks
- Shuffle process maps the mapper output with the same key to the collection of values as a value
- For example (key1, val1) and (key1, val2) will be converted to (key1, [val1, val2])
- The mapper and reducer tasks would in parallel
- The reducer tasks can start their word as soon as mapper tasks are completed
Lets understand MapReduce algorithm using a simple example of word count,
We have a file with below content,
Map Reduce is easy to learn.
Map Reduce is simple.
So Input to mapper task would be below (key, value) pair,
(2103, "Map Reduce is easy to learn")
(2130, "Map Reduce is simple")
Note: Here key is file offset, most of the time it is not used
Now our mapper task will extract individual words from the line and will generate list of (key , value) pairs,
So Output of mapper tasks will be,
Now this data is shuffled and sorted by key. So output after shuffle and sort process will be,
Above output will be the input of reducer tasks which will aggregate the values with the same keys and generate the below output,
Finally these output will be written into a file.
I hope you find this article useful in refer my next post for understanding Hadoop Daemons.