Setup Multi Node Hadoop 2.6.0 Cluster with YARN

Today is the era of parallel computation and whenever we talk about processing very large chunk of datasets the first word that comes in everyone’s mind is HADOOP. Apache Hadoop sits at the peak of Apache Project lists. In this post I’ll explain you all steps of setting up a Bazic Multi Node Hadoop Cluster (we’ll setup two node cluster).

Here I have used two machines for cluster setup you can repeat the steps of setting up slave nodes on more machines in order to create bigger Hadoop cluster.

Prior to that I assume that you have gone through below checklist and I prefer you to learn about below points if you found them new.

  • Prepare new Machines or VMs with CentOS installed (I have used CentOS 6.4)
  • Setup with Static IP and proper FQDN
  • Make sure all machines have proper IP and HOSTNAME entries in /etc/hosts
  • Setup Passwordless SSH from master node to slave nodes
  • Make sure that IPv6 is disabled on all nodes

Step 1 : Disable IPv6 on CentOS node (If your network supports IPv6)

If your node supports IPv6 than I would recommend to disable IPv6 by editing /etc/sysctl.conf file as Hadoop is not supported on IPv6 network. Append following to the end of the file

Read more about this on Hadoop IPv6 WIKI.

So let’s get started…..

Step 2 : Download Hadoop 2.6.0 and extract it to /opt/ directory on Master Node

I have used following machines.

Master: 192.168.1.10 – master.backtobazics.com
Slave 1: 192.168.1.11 – slave1.backtobazics.com

Below are the commands,

Step 3 : Configure Variables and Reload the Configuration

Set environment variable uses by hadoop by editing /etc/profile file and append following values at end of file by executing below commands.

Reload Configuration using below command.

Step 4 : Setting up Hadoop Environment

Create Hadoop data directories

Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable with JDK base directory path.

Step 5 : Edit Hadoop XML Configuration files

Edit Configuration Files located at $HADOOP_HOME/etc/hadoop/ directory with very basic configurations.

hdfs-site.xml

 

core-site.xml

 

mapred-site.xml

yarn-site.xml

Append host names of all the slave nodes in slaves file. In my case it would be,

Step 6 : Setting up Slave nodes

For setting up the slave node, Repeat 2 to 5 steps or copy /opt/hadoop-2.6.0 directory and repeat step 3 & 4 @ slave node keeping the directory structure same.

That it…..!!!!! We are done with the installation of Hadoop (Distributed mode) with YARN on Multiple Nodes. 🙂

We need to format Hadoop NameNode using below command prior to start Hadoop cluster

Step 7 : Commands for starting and stopping Hadoop Cluster

Start/Stop HDFS using below commands

Start/Stop YARN services using below commands

Step 8 : Port Filtering in Firewall by updating below entries in /etc/sysconfig/iptables file

Append below lines in /etc/sysconfig/iptables file

And restart iptables

Performing above step on both the server will open up HTTP access to Web UI of Hadoop processes.

Note : Instead of performing  step 8 you can also disable iptables service using following command in all machines.

Now you can access Hadoop Services in Browser

Name Node: http://master.backtobazics.com:50070/
YARN Services: http://master.backtobazics.com:8088/
Secondary Name Node: http://master.backtobazics.com:50090/
Data Node 1: http://master.backtobazics.com:50075/
Data Node 2: http://slave1.backtobazics.com:50075

10 comments

  1. Nice Blog Varun !!

  2. Hi,

    Thank you for the steps. The details are very useful and clear.

    I had problem with installation as it was giving me error “JAVA_HOME not set could not be found”.

    I resolved this by modifying java_home in /opt/hadoop-2.6.0/etc/hadoop/hadoop-env.sh file

    below is my entry which worked.

    export JAVA_HOME=/usr/java/jdk1.8.0_60/

    1. Thank you for reading Srinath..

  3. Hi Varun,

    Great tutorial ! Helped me a lot to understand the hadoop environment.
    Will this setup be compatible to run spark-submit tasks ?
    Specifically pyspark programs i have writin in python .

    Thanks alot

    1. Thanks Eli… And yes this setup will definitely be compatible to execute spark-submit with any version of Spark provided that you have used proper binaries of Spark pre-build with specific Hadoop version. You can check my posts “6 Steps to Setup Apache Spark 1.0.1 (Multi Node Cluster) on CentOS” and “Building Spark Application JAR using Scala and SBT” for more details….

  4. Hi Varun Vyas,

    Thank you, I seem 8 pages about Hadoop from the basic, I am very clear and configured Master with 1 slave, Incase if I want to add another slave, How to configure in hdfs-site.xml file ?

    1. Hi, Thanks for reading this article. In case you want to add new slave node to your cluster, you need to make copy same slave configurations(along with hdfs-site.xml) mentioned in this post to your slave node and add one entry of the hostname of your new slave node to “slaves” file on master node.

  5. Hi Varun,

    I am newbie with spark. I need to submit spark jobs to determinate slave node. Do you know if I can do that with a Yarn or Mesos cluster? I read the options of spark but I could not see how to allocate jobs a determinate machine.

    Thank you.

    1. Hi arwing,

      Can you please elaborate more on your problem? From your comment what I understood is you need to know what are the active slave nodes in your spark cluster. And you what to launch your job to some specific nodes. Is that correct?

  6. I get strange errors when starting HDFS:

    /opt/hadoop-2.6.0/sbin/start-dfs.sh
    Exception in thread “main” java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
    at org.apache.hadoop.security.Groups.(Groups.java:77)
    at org.apache.hadoop.security.Groups.(Groups.java:73)
    at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
    at org.apache.hadoop.hdfs.tools.GetConf.run(GetConf.java:314)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.hdfs.tools.GetConf.main(GetConf.java:331)
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
    … 12 more
    Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
    at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method)
    at org.apache.hadoop.security.JniBasedUnixGroupsMapping.(JniBasedUnixGroupsMapping.java:49)
    at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.(JniBasedUnixGroupsMappingWithFallback.java:39)
    … 17 more
    Starting namenodes on []
    localhost: starting namenode, logging to /var/log/hadoop/root/hadoop-root-namenode-itprod-dba-test-sa-02.out
    localhost: log4j:ERROR Could not find value for key log4j.appender.RFA
    localhost: log4j:ERROR Could not instantiate appender named “RFA”.
    localhost: log4j:ERROR Could not find value for key log4j.appender.RFA
    localhost: log4j:ERROR Could not instantiate appender named “RFA”.
    localhost: starting datanode, logging to /var/log/hadoop/root/hadoop-root-datanode-itprod-dba-test-sa-02.out
    localhost: log4j:ERROR Could not find value for key log4j.appender.RFA
    localhost: log4j:ERROR Could not instantiate appender named “RFA”.
    localhost: log4j:ERROR Could not find value for key log4j.appender.RFA
    localhost: log4j:ERROR Could not instantiate appender named “RFA”.
    Exception in thread “main” java.lang.RuntimeException: java.lang.reflect.InvocationTargetExceptio

Leave a Reply

Your email address will not be published. Required fields are marked *