Apache Hadoop solves very different kind of problems in Big Data world. So before we get an introduction of Hadoop it becomes necessary to understand the core problems in large scale computation. Than after we’ll try to understand how Hadoop solves these problems.
So let’s discuss the pain points first…..
What is the requirement of distributed systems?
In answer to above question, I have listed couple of points below,
Growing data size
- Today many organizations and companies are generating far more data than earlier
- Organizations like Twitter and Facebook are generating data at a rate of terabytes per day
So Problem is: How to process a data with the rate of TB/day?
Problems in processing of huge amount of data
- Our traditional model of programming and computation is designed to process relatively small amount of data
- Computation time is also highly dependent on CPU processing power and RAM
- Whenever we require more processing power we have to upgrade CPU and RAM and improve the scalability (Vertical Scalability)
So Problem is: How to deal with vertical scalability issue?
Complexity in distributed processing
In order to deal with vertical scalability issue we can use distributed system for computation which allows us to use multiple machines to solve any computation problem. But ordinary distributed processing also has below issues,
- Data synchronization between machines
- Data needs to be transferred to all compute nodes from storage node(SAN : Storage Area Network) for computation purpose so more bandwidth is required
- Failure of one or more components
So Problem is: How to over come above limitations?
So many question…..!!!!! But Don’t worry….
We have following point which will give us the answer of all above questions.
New Model for Distributed Computing
To handle all complexities and solve problems described in previous point we require a new architecture for distributed computing. Let me lead you towards the primary requirements of such distributed system architecture.
It means, this architecture should be capable enough to dynamically add a new computation node in order to scale the system for more computations and/or for more computation performance without system failure. Adding more nodes means more scalability.
Partial Failure instead of Full System Failure
The architecture should be capable to handle the failure of one or more nodes from cluster. It would decrease computational capabilities of the overall system but should not result in full system failure.
Recovery of Data
Any node failure in system should not result in data loss. That means, architecture should be capable to replicate the data and recover it in case of partial system failure. At the same time it should also be capable enough to shift the computation of the dead node to any living node at runtime if failure occures while any computation is in progress.
Recovery of Node
If any node is failed and recovered after sometime, it should also be able to add that recovered component without full system restart.
Output of your job should be consistent in normal execution as well as in case of partial system failure.
I think now you have enough knowledge in order to know what is Hadoop and how it works. So let’s get started on that…..
What is Hadoop?
Hadoop is the open source project which takes care of all the above points for distributed computing. It is completely based on the concept of Google File System and MapReduce. The core concept of Hadoop is to divide data into tiny chunks and distribute them to all cluster nodes. And when computation process is executed on the cluster it process the data chunk which resides on that node. This approach is called “Data Locality” which reduces the consumption of cluster bandwidth by reducing the data transfers across the nodes which was done in our traditional distributed processing model.
Thanks for your time. And Stay tuned for next article on Hadoop Architecture and its core elements…..!!!!!