November 18, 2013

How to Configure Apache Hadoop in Pseudo Distributed Mode

My first tutorial briefly introduced Apache Hadoop, talked about the different types of modes in which you can execute Hadoop, outlined the prerequisites of setting up Hadoop correctly and explained in detail how to setup Apache Hadoop in Standalone Mode. In this second tutorial I will illustrate the steps required to setup Apache Hadoop in Pseudo Distributed Mode.

Installing & Configuring Hadoop in Pseudo Distributed Mode

Step-1: Configuring master & slave nodes

We will be using two machines one as a master and the other one as slave. I have used Ubuntu-11.10 in this demonstration. I named one virtual machine as Ubuntu1 and other as Ubuntu 2.

  • Change the hostname of these machines using the  command:
    $ sudo gedit /etc/hostname
  • Give Ubuntu1 hostname as master and Ubuntu2 as slave.You can verify the hostname by executing the command: $ hostname
  • If the updated name fails to appear then restart the hostname service using the command:
    $ sudo service hostname start
  • Now edit the hostentires in both machines, using command:
    $ sudo gedit /etc/hosts


    $ sudo vi /etc/host
  • Add master and slave machine IP & name these files as192.168.118.149 master192.168.118.151 slave

machine IP

Step-2: Configuring SSH on all nodes (master & slaves)

  • Install SSH on all nodes using the commands.
    sudo apt-get install ssh
    sudo apt-get install sshd

    (this step is required to connect to other machines)

  • Generate ssh keylssh-keygen -t rsa -P “” (press enter when asked for file name; this will generate a passwordless ssh file)
  • Now copy the public key ( of current machine to authorized_keysExecuting the following command will copy the generated public key in the .ssh/authorized_keys  file.
    cat $HOME/.ssh/ >> $HOME/.ssh/authorized_keys
  • Verify ssh configuration using the command
    ssh localhost

Pressing yes will add localhost to known hosts

Once SSH is successfully configured on all nodes, confirm passwordless SSH connectivity from master to slave nodes and vice versa using this command.

ssh master

Step-3: Install Java on all nodes (master & slaves)

  • First check if Java is already installed or not by running the command
  • If java is not installed or the version is not appropriate, install Java using the command
    sudo apt-get install openjdk-6-jdk
  • Install the utility python softwares by running these commands
    sudo apt-get install python-software-properties
    sudo add-apt-repository ppa:ferramroberto/java
    sudo apt-get update
  • Install sun java
    sudo apt-get install sun-java6-jdk
    sudo update-java-alternatives -s java-6-sun

Step-4: Installing & Configuring Hadoop all nodes (master & slaves)

Or alternatively

Hadoop will now get downloaded in the home folder

  • Go to your home folder and extract the downloaded Hadoop tar file using the command.
tar –xzfv hadoop-1.2.1.tar.gz
  • HADOOP_HOME is in 1.2.1 version but if you are using any older versions you might want to set HADOOP_HOME to the latest version using the commands.
    sudo gedit ~/. bashrc (befor this you may require changing the file permission by: sudo chmod 777 <filename> )
    export HADOOP_HOME=/user/home/hadoop   (this is just an example)
    export HADOOP_HOME = /home/girish/hadoop-1.2.1 (this is just an example)
    export PATH= $PATH:$HADOOP_HOME/bin
  • Set Java_Home in conf/ file as belowUsually Java is installed at /usr/lib/jvmPick the sun java, right click on THIRDPARTYLICENSEREADME.txtAnd see the Location value, should look like-  /usr/lib/jvm/java-6-sun
  • Change the Hadoop home directory privileges
    chown -R girish hadoop-1.2.1

    (Instead of pasting it to terminal prefer typing)

    chmod -R 755 hadoop-1.2.1

Step-5: Modify Hadoop configuration files (master & slaves)

  • Create a dir hdfs with subdirectories as data, name and temp
  • Create a dir tempdir under home directory of the user (you want to use for hadoop)
  • Update conf/core-site.xml as below:
  • Change <name>hadoop.tmp.dir</name> value to /home/girish/tempdirLike below:<value>/home/girish/tempdir</value>
  • Change localhost<name></name><value>hdfs://localhost:54310</value>Like below:<value>hdfs://master:9000</value>

Modify Hadoop configuration

Update conf/ mapred-site as below:

Update conf

Update conf/ hdfs-site as below:

Update Configuration File

Update the file-master under conf directory

Change localhost by your master m/c name (in our case master machine name is master itself)

update.conf file

Similarly update slaves file under conf directory. Here add the name for the node that you want to be categorized as the slave node. You can add your master node name as well as slave; this will run data node on master as well.

Format Hadoop NameNode

Now copy Hadoop (already configured) to other nodes from master node

To copy from Ubuntu1 to Ubuntu2 use the below command:

scp -r hadoop-1.2.1 girish@slave:/home/girish


Step-6. Format Hadoop NameNode-

HDFS Architecture 

  • Format the master machine name node using below commandGo to dir /hadoop-1.2.1/binand then
    hadoop namenode –format

hadoop namenode


  • Execute the below command from hadoop home directory
    $ ~/hadoop-1.2.1/bin/hadoop namenode -format

hadoop namenode

Step-7. Start Hadoop daemons

Now run Hadoop using below command

$ ~/hadoop-1.2.1/bin/


Step-8. Verify the daemons are running

$ jps  (if jps is not in path, try  /usr/java/latest/bin/jps)

output will look similar to this

9316 SecondaryNameNode

9203 DataNode

9521 TaskTracker

9403 JobTracker

9089 NameNode

This shows we have all the daemons running.

Step-9. Verify UIs by namenode & job tracker

Verify UIs by namenode & job tracker, by following URLs

namenode UI:   https://machine_host_name:50070

job tracker UI:   https://machine_host_name:50030

Substitute 'machine host name' with either public IP or hostname of your node e.g:, which in our case will be like below.

namenode UI:   https://master:50070

job tracker UI:   https://master:50030

Now you have successfully installed and configured Hadoop in Pseudo Distributed mode.