Hadoop 1.x Architecture is a history now because in most of the Hadoop applications are using Hadoop 2.x Architecture. But still understanding of Hadoop 1.x Architecture will provide us the insights of how hadoop has evolved over the time. As explained in my post related to Hadoop Core Concepts, Hadoop mainly provides a distributed storage (HDFS) and distributed computation engine (MapReduce) to solve certain problems of Big Data World. Both of these core components have some set of processes(daemons).
HDFS – Hadoop Distributed File System
HDFS in Hadoop 1.x mainly has 3 daemons which are Name Node, Secondary Name Node and Data Node.
- There is only single instance of this process runs on a cluster and that is on a master node
- It is responsible for manage metadata about files distributed across the cluster
- It manages information like location of file blocks across cluster and it’s permission
- This process reads all the metadata from a file named
fsimageand keeps it in memory
- After this process is started, it updates metadata for newly added or removed files in RAM
- It periodically writes the changes in one file called
editsas edit logs
- This process is a heart of HDFS, if it is down HDFS is not accessible any more
Secondary Name Node
- For this also, only single instance of this process runs on a cluster
- This process can run on a master node (for smaller clusters) or can run on a separate node (in larger clusters) depends on the size of the cluster
- One misinterpretation from name is “This is a backup Name Node” but IT IS NOT!!!!!
- It manages the metadata for the Name Node. In the sense, it reads the information written in edit logs (by Name Node) and creates an updated file of current cluster metadata
- Than it transfers that file back to Name Node so that
fsimagefile can be updated
- So, whenever Name Node daemon is restarted it can always find updated information in
- There are many instances of this process running on various slave nodes(referred as Data nodes)
- It is responsible for storing the individual file blocks on the slave nodes in Hadoop cluster
- Based on the replication factor, a single block is replicated in multiple slave nodes(only if replication factor is > 1) to prevent the data loss
- Whenever required, this process handles the access to a data block by communicating with Name Node
- This process periodically sends heart bits to Name Node to make Name Node aware that slave process is running
Hadoop 1.x has a Single Point Of Failure
As per above description, HDFS has only one Name Node so if that process or machine goes down complete cluster goes down. That is why Name Node in Hadoop 1.x is considered to be a Single Point of Failure.
MapReduce – Distributed Computation
As explained in my post Simple explanation of Hadoop Core Components, MapReduce component is responsible for distributed computing on Hadoop Cluster. Basically it uses two daemons named Job Tracker and Task Tracker. Hadoop MapReduce algorithm basically has two phases Map and Reduce
- Only one instance of this process runs on a master node same as Name Node
- Any MapReduce job is submitted to Job Tracker first
- Job Tracker checks for the location various file blocks used in MapReduce processing
- Than it initiates the separate tasks on various Data Nodes(where blocks are present) by communicating with Task Tracker Daemons
- This process has multiple instances running on the slave nodes(typically runs on the slave nodes where Data Node process is running)
- It receives the job information from Job Tracker daemon and initiates a task on that slave node
- In most of the cases, Task Tracker initiates the task on the same node where there physical data block is present
- Same as Data Node daemon, this process also periodically sends heart bits to Job Tracker to make Job Tracker aware that slave process is running
Full picture of Hadoop 1.x Architecture
So after this small set of information if we look at the Hadoop 1.x architecture as a whole, it will look like below image.
Limitations of Hadoop 1.x Architecture
- As we also have seen in above description, major drawback of Hadoop 1.x Architecture is Single Point of Failure as there is no backup Name Node.
- Job scheduling, resource management and job monitoring are being done by Job Tracker which is tightly coupled with Hadoop. So Job Tracker is not able to manage resources outside Hadoop.
- Job Tracker has to coordinate with all task tracker so in a very big cluster it will be difficult to manage huge number of task trackers altogether.
- Due to single Name Node, there is no concept of namespaces in Hadoop 1.x. So everything is being managed under single namespace.
- Using Hadoop 1.x architecture, Hadoop Cluster can be scaled upto ~4000 nodes. Scalablity beyond that may cause performance degradation and increasing task failure ratio.
I hope you will like this information. Stay tuned for other interesting information and please let me know your thoughts by writing comments below.