In my earlier post about Brief Introduction of Hadoop, we have understood “What is Hadoop and What kind of problems it solves”. The next step is to understand Hadoop Core Concepts which talks more about,
- Distributed system design
- How data is distributed across multiple systems
- What are the different components involved and how they communicate with each others
Now let’s deep dive and learn about core concepts of Hadoop and it’s architecture.
- Hadoop Architecture distributes data across the cluster nodes by splitting it into small blocks (64 MB or 128 MB depending upon the configurations). Every node tries to work on individual block stored locally. Hence no data transfer is required during the processing most of the time.
- Every time when Hadoop process the data, each node connects to other node as much less as possible. So most of the time node only deals with the data which is stored locally. This concept is known as “Data Locality”.
- As we know data is distributed across the nodes, so in order to increase data availability each block of the data is replicated (as per the configuration) over different nodes. This method would help Hadoop to handle partial failures in the cluster.
- Whenever MapReduce job (consist to typically two tasks Map Task and Reduce Task) is executed, Map tasks are executed on individual data blocks on each node (in most of the cases) and leverage “Data Locality”. This is how multiple nodes process data in parallel manner. It makes processing faster than traditional distributed systems. I’ll Explain you MapReduce in my upcoming posts.
- If any node fails in between, the master will detect this failure and assign the same task to another node where the replica of the same data block is available.
- If any failed node restarts, it automatically joins back to the cluster and than after master can assign it a new task whenever is required.
- In the execution process of any job, if master detects any task running slowly on any node compare to other nodes, it will allocate the same redundant task on other node to make overall execution faster. This process is known as “Speculative Execution“
- The Hadoop jobs are written in a high level language like Java and Python, and the developer do not have any control over network programming or component level programming. You just have to focus on core logic other things would be taken care by Hadoop framework itself.
That’s it for now. In my next post related Hadoop Core Components we’ll learn more details.