Setup Multi Node Hadoop 2.6.0 Cluster with YARN

Today is the era of parallel computation and whenever we talk about processing very large chunk of datasets the first word that comes in everyone’s mind is HADOOP. Apache Hadoop sits at the peak of Apache Project lists. In this post I’ll explain you all steps of setting up a Bazic Multi Node Hadoop Cluster (we’ll setup two node cluster).

Here I have used two machines for cluster setup you can repeat the steps of setting up slave nodes on more machines in order to create bigger Hadoop cluster.

Prior to that I assume that you have gone through below checklist and I prefer you to learn about below points if you found them new.

  • Prepare new Machines or VMs with CentOS installed (I have used CentOS 6.4)
  • Setup with Static IP and proper FQDN
  • Make sure all machines have proper IP and HOSTNAME entries in /etc/hosts
  • Setup Passwordless SSH from master node to slave nodes
  • Make sure that IPv6 is disabled on all nodes

Step 1 : Disable IPv6 on CentOS node (If your network supports IPv6)

If your node supports IPv6 than I would recommend to disable IPv6 by editing /etc/sysctl.conf file as Hadoop is not supported on IPv6 network. Append following to the end of the file

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Read more about this on Hadoop IPv6 WIKI.

So let’s get started…..

Step 2 : Download Hadoop 2.6.0 and extract it to /opt/ directory on Master Node

I have used following machines.

Master: 192.168.1.10 – master.backtobazics.com
Slave 1: 192.168.1.11 – slave1.backtobazics.com

Below are the commands,

$ cd /opt/
$ wget http://apache.bytenet.in/hadoop/common/stable/hadoop-2.6.0.tar.gz
$ tar –xvf hadoop-2.6.0.tar.gz

Step 3 : Configure Variables and Reload the Configuration

Set environment variable uses by hadoop by editing /etc/profile file and append following values at end of file by executing below commands.

$ echo "" >> /etc/profile
$ echo "### HADOOP Variables ###" >> /etc/profile
$ echo "export HADOOP_HOME=/opt/hadoop-2.6.0" >> /etc/profile
$ echo "export HADOOP_INSTALL=\$HADOOP_HOME" >> /etc/profile
$ echo "export HADOOP_MAPRED_HOME=\$HADOOP_HOME" >> /etc/profile
$ echo "export HADOOP_COMMON_HOME=\$HADOOP_HOME" >> /etc/profile
$ echo "export HADOOP_HDFS_HOME=\$HADOOP_HOME" >> /etc/profile
$ echo "export YARN_HOME=\$HADOOP_HOME" >> /etc/profile
$ echo "export HADOOP_COMMON_LIB_NATIVE_DIR=\$HADOOP_HOME/lib/native" >> /etc/profile
$ echo "export PATH=\$PATH:\$HADOOP_HOME/sbin:\$HADOOP_HOME/bin" >> /etc/profile

Reload Configuration using below command.

$ source /etc/profile

Step 4 : Setting up Hadoop Environment

Create Hadoop data directories

$ mkdir -p /data/hadoop-data/nn 
$ mkdir -p /data/hadoop-data/snn 
$ mkdir -p /data/hadoop-data/dn 
$ mkdir -p /data/hadoop-data/mapred/system 
$ mkdir -p /data/hadoop-data/mapred/local

Now edit $HADOOP_HOME/etc/hadoop/hadoop-env.sh file and set JAVA_HOME environment variable with JDK base directory path.

$ export JAVA_HOME=/usr/java/jdk1.8.0_40/

Step 5 : Edit Hadoop XML Configuration files

Edit Configuration Files located at $HADOOP_HOME/etc/hadoop/ directory with very basic configurations.

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>file:///data/hadoop-data/nn</value>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>file:///data/hadoop-data/dn</value>
    </property>
    <property>
        <name>dfs.namenode.checkpoint.dir</name>
        <value>file:///data/hadoop-data/snn</value>
    </property>
</configuration>

core-site.xml

<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master.backtobazics.com:9000</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master.backtobazics.com</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>

Append host names of all the slave nodes in slaves file. In my case it would be,

master.backtobazics.com
slave1.backtobazics.com

Step 6 : Setting up Slave nodes

For setting up the slave node, Repeat 2 to 5 steps or copy /opt/hadoop-2.6.0 directory and repeat step 3 & 4 @ slave node keeping the directory structure same.

That it…..!!!!! We are done with the installation of Hadoop (Distributed mode) with YARN on Multiple Nodes. :)

We need to format Hadoop NameNode using below command prior to start Hadoop cluster

$ hdfs namenode –format

Step 7 : Commands for starting and stopping Hadoop Cluster

Start/Stop HDFS using below commands

sh $HADOOP_HOME/sbin/start-dfs.sh
sh $HADOOP_HOME/sbin/stop-dfs.sh

Start/Stop YARN services using below commands

sh $HADOOP_HOME/sbin/start-yarn.sh
sh $HADOOP_HOME/sbin/stop-yarn.sh

Step 8 : Port Filtering in Firewall by updating below entries in /etc/sysconfig/iptables file

Append below lines in /etc/sysconfig/iptables file

-A INPUT -m state --state NEW -m tcp -p tcp --dport 50090 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 50105 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 50075 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 50070 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 50475 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 50470 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8032 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8030 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8088 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8090 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8031 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8033 -j ACCEPT

And restart iptables

$ sudo service iptables restart

Performing above step on both the server will open up HTTP access to Web UI of Hadoop processes.

Note : Instead of performing  step 8 you can also disable iptables service using following command in all machines.

$ sudo service iptables stop

Now you can access Hadoop Services in Browser

Name Node: http://master.backtobazics.com:50070/
YARN Services: http://master.backtobazics.com:8088/
Secondary Name Node: http://master.backtobazics.com:50090/
Data Node 1: http://master.backtobazics.com:50075/
Data Node 2: http://slave1.backtobazics.com:50075/

10 thoughts on “Setup Multi Node Hadoop 2.6.0 Cluster with YARN”

  1. Hi,

    Thank you for the steps. The details are very useful and clear.

    I had problem with installation as it was giving me error “JAVA_HOME not set could not be found”.

    I resolved this by modifying java_home in /opt/hadoop-2.6.0/etc/hadoop/hadoop-env.sh file

    below is my entry which worked.

    export JAVA_HOME=/usr/java/jdk1.8.0_60/

  2. Hi Varun,

    Great tutorial ! Helped me a lot to understand the hadoop environment.
    Will this setup be compatible to run spark-submit tasks ?
    Specifically pyspark programs i have writin in python .

    Thanks alot

  3. Hi Varun Vyas,

    Thank you, I seem 8 pages about Hadoop from the basic, I am very clear and configured Master with 1 slave, Incase if I want to add another slave, How to configure in hdfs-site.xml file ?

    1. Hi, Thanks for reading this article. In case you want to add new slave node to your cluster, you need to make copy same slave configurations(along with hdfs-site.xml) mentioned in this post to your slave node and add one entry of the hostname of your new slave node to “slaves” file on master node.

  4. Hi Varun,

    I am newbie with spark. I need to submit spark jobs to determinate slave node. Do you know if I can do that with a Yarn or Mesos cluster? I read the options of spark but I could not see how to allocate jobs a determinate machine.

    Thank you.

    1. Hi arwing,

      Can you please elaborate more on your problem? From your comment what I understood is you need to know what are the active slave nodes in your spark cluster. And you what to launch your job to some specific nodes. Is that correct?

  5. I get strange errors when starting HDFS:

    /opt/hadoop-2.6.0/sbin/start-dfs.sh
    Exception in thread “main” java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
    at org.apache.hadoop.security.Groups.(Groups.java:77)
    at org.apache.hadoop.security.Groups.(Groups.java:73)
    at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
    at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
    at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
    at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
    at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
    at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
    at org.apache.hadoop.hdfs.tools.GetConf.run(GetConf.java:314)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.hdfs.tools.GetConf.main(GetConf.java:331)
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:129)
    … 12 more
    Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative()V
    at org.apache.hadoop.security.JniBasedUnixGroupsMapping.anchorNative(Native Method)
    at org.apache.hadoop.security.JniBasedUnixGroupsMapping.(JniBasedUnixGroupsMapping.java:49)
    at org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback.(JniBasedUnixGroupsMappingWithFallback.java:39)
    … 17 more
    Starting namenodes on []
    localhost: starting namenode, logging to /var/log/hadoop/root/hadoop-root-namenode-itprod-dba-test-sa-02.out
    localhost: log4j:ERROR Could not find value for key log4j.appender.RFA
    localhost: log4j:ERROR Could not instantiate appender named “RFA”.
    localhost: log4j:ERROR Could not find value for key log4j.appender.RFA
    localhost: log4j:ERROR Could not instantiate appender named “RFA”.
    localhost: starting datanode, logging to /var/log/hadoop/root/hadoop-root-datanode-itprod-dba-test-sa-02.out
    localhost: log4j:ERROR Could not find value for key log4j.appender.RFA
    localhost: log4j:ERROR Could not instantiate appender named “RFA”.
    localhost: log4j:ERROR Could not find value for key log4j.appender.RFA
    localhost: log4j:ERROR Could not instantiate appender named “RFA”.
    Exception in thread “main” java.lang.RuntimeException: java.lang.reflect.InvocationTargetExceptio

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>