How to Configure Apache Hadoop in Pseudo Distributed Mode

My first tutorial briefly introduced Apache Hadoop, talked about the different types of modes in which you can execute Hadoop, outlined the prerequisites of setting up Hadoop correctly and explained in detail how to setup Apache Hadoop in Standalone Mode. In this second tutorial I will illustrate the steps required to setup Apache Hadoop in Pseudo Distributed Mode.

Installing & Configuring Hadoop in Pseudo Distributed Mode

Step-1: Configuring master & slave nodes

We will be using two machines one as a master and the other one as slave. I have used Ubuntu-11.10 in this demonstration. I named one virtual machine as Ubuntu1 and other as Ubuntu 2.

  • Change the hostname of these machines using the  command:
    $ sudo gedit /etc/hostname
  • Give Ubuntu1 hostname as master and Ubuntu2 as slave.You can verify the hostname by executing the command: $ hostname
  • If the updated name fails to appear then restart the hostname service using the command:
    $ sudo service hostname start
  • Now edit the hostentires in both machines, using command:
    $ sudo gedit /etc/hosts

    or

    $ sudo vi /etc/host
  • Add master and slave machine IP & name these files as192.168.118.149 master192.168.118.151 slave
machine IP

Step-2: Configuring SSH on all nodes (master & slaves)

  • Install SSH on all nodes using the commands.
    sudo apt-get install ssh
    sudo apt-get install sshd

    (this step is required to connect to other machines)

  • Generate ssh keylssh-keygen -t rsa -P “” (press enter when asked for file name; this will generate a passwordless ssh file)
  • Now copy the public key (id_rsa.pub) of current machine to authorized_keysExecuting the following command will copy the generated public key in the .ssh/authorized_keys  file.
    cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
  • Verify ssh configuration using the command
    ssh localhost

Pressing yes will add localhost to known hosts

Once SSH is successfully configured on all nodes, confirm passwordless SSH connectivity from master to slave nodes and vice versa using this command.

ssh master

Step-3: Install Java on all nodes (master & slaves)

  • First check if Java is already installed or not by running the command
    $jsp
  • If java is not installed or the version is not appropriate, install Java using the command
    sudo apt-get install openjdk-6-jdk
  • Install the utility python softwares by running these commands
    sudo apt-get install python-software-properties
    sudo add-apt-repository ppa:ferramroberto/java
    sudo apt-get update
  • Install sun java
    sudo apt-get install sun-java6-jdk
    sudo update-java-alternatives -s java-6-sun
    
    

Step-4: Installing & Configuring Hadoop all nodes (master & slaves)

Or alternatively

Hadoop will now get downloaded in the home folder

  • Go to your home folder and extract the downloaded Hadoop tar file using the command.
tar –xzfv hadoop-1.2.1.tar.gz
  • HADOOP_HOME is in 1.2.1 version but if you are using any older versions you might want to set HADOOP_HOME to the latest version using the commands.
    sudo gedit ~/. bashrc (befor this you may require changing the file permission by: sudo chmod 777 <filename> )
    export HADOOP_HOME=/user/home/hadoop   (this is just an example)
    export HADOOP_HOME = /home/girish/hadoop-1.2.1 (this is just an example)
    export PATH= $PATH:$HADOOP_HOME/bin
  • Set Java_Home in conf/hadoop-env.sh file as belowUsually Java is installed at /usr/lib/jvmPick the sun java, right click on THIRDPARTYLICENSEREADME.txtAnd see the Location value, should look like-  /usr/lib/jvm/java-6-sun
  • Change the Hadoop home directory privileges
    chown -R girish hadoop-1.2.1

    (Instead of pasting it to terminal prefer typing)

    chmod -R 755 hadoop-1.2.1

Step-5: Modify Hadoop configuration files (master & slaves)

  • Create a dir hdfs with subdirectories as data, name and temp
  • Create a dir tempdir under home directory of the user (you want to use for hadoop)
  • Update conf/core-site.xml as below:
  • Change <name>hadoop.tmp.dir</name> value to /home/girish/tempdirLike below:<value>/home/girish/tempdir</value>
  • Change localhost<name>fs.default.name</name><value>hdfs://localhost:54310</value>Like below:<value>hdfs://master:9000</value>
Modify Hadoop configuration

Update conf/ mapred-site as below:

Update conf

Update conf/ hdfs-site as below:

Update Configuration File

Update the file-master under conf directory

Change localhost by your master m/c name (in our case master machine name is master itself)

update.conf file

Similarly update slaves file under conf directory. Here add the name for the node that you want to be categorized as the slave node. You can add your master node name as well as slave; this will run data node on master as well.

Format Hadoop NameNode

Now copy Hadoop (already configured) to other nodes from master node

To copy from Ubuntu1 to Ubuntu2 use the below command:

scp -r hadoop-1.2.1 girish@slave:/home/girish
hadoop

Step-6. Format Hadoop NameNode-

HDFS Architecture 

  • Format the master machine name node using below commandGo to dir /hadoop-1.2.1/binand then
    hadoop namenode –format

hadoop namenode

or

  • Execute the below command from hadoop home directory
    $ ~/hadoop-1.2.1/bin/hadoop namenode -format
hadoop namenode

Step-7. Start Hadoop daemons

Now run Hadoop using below command

$ ~/hadoop-1.2.1/bin/start-all.sh
hadoop

Step-8. Verify the daemons are running

$ jps  (if jps is not in path, try  /usr/java/latest/bin/jps)

output will look similar to this

9316 SecondaryNameNode

9203 DataNode

9521 TaskTracker

9403 JobTracker

9089 NameNode

This shows we have all the daemons running.

Step-9. Verify UIs by namenode & job tracker

Verify UIs by namenode & job tracker, by following URLs

namenode UI:   http://machine_host_name:50070

job tracker UI:   http://machine_host_name:50030

Substitute ‘machine host name’ with either public IP or hostname of your node e.g:   http://ec2……com:50070, which in our case will be like below.

namenode UI:   http://master:50070

job tracker UI:   http://master:50030

Now you have successfully installed and configured Hadoop in Pseudo Distributed mode.

Girish Kumar

Girish Kumar

Technical Lead

Girish Kumar is a Technical Lead at 3Pillar Global and the head of our Java Competency Center in India. He has been working in the Java domain for over 8 years and has gained rich expertise in a wide array of Java technologies including Spring, Hibernate and Web Services. In addition, he has good exposure in implementation of complete SDLC using Agile and TDD methodology. Prior to joining 3Pillar Global, Girish was working with Cognizant Technology Solutions for more than 5 years. Over there he has worked for some of the biggest names in the Banking and Finance verticals in U.S. & U.K.

Girish’s current challenges at 3Pillar include getting the best out of Apache Hadoop, NoSQL and distributed systems. He provides day-to-day leadership to the members of the Java Competency Center in India by enforcing best practices and providing technical guidance in key projects.

One Response to “How to Configure Apache Hadoop in Pseudo Distributed Mode”
  1. Suji on

    Is it possible to create multiple datanodes in a single machine?( in the pseudo-distributed mode)

    Reply
Leave a Reply

Related Posts

Designing the Future & the Future of Work – The I... Martin Wezowski, Chief Designer and Futurist at SAP, shares his thoughts on designing the future and the future of work on this episode of The Innovat...
The 4 Characteristics of a Healthy Digital Product Team Several weeks ago, I found myself engaged in two separate, yet eerily similar, conversations with CEOs struggling to gain the confidence they needed t...
Recapping Fortune Brainstorm Tech – The Innovation Eng... On this episode of The Innovation Engine, David DeWolf and Jonathan Rivers join us to share an overview of all the news that was fit to print at this ...
4 Reasons Everyone is Wrong About Blockchain: Your Guide to ... You know a technology has officially jumped the shark when iced tea companies decide they want in on the action. In case you missed that one, Long Isl...
The Connection Between Innovation & Story On this episode of The Innovation Engine, we'll be looking at the connection between story and innovation. Among the topics we'll cover are why story ...