My first tutorial briefly introduced Apache Hadoop, talked about the different types of modes in which you can execute Hadoop, outlined the prerequisites of setting up Hadoop correctly and explained in detail how to setup Apache Hadoop in Standalone Mode. In this second tutorial I will illustrate the steps required to setup Apache Hadoop in Pseudo Distributed Mode.
Installing & Configuring Hadoop in Pseudo Distributed Mode
Step-1: Configuring master & slave nodes
We will be using two machines one as a master and the other one as slave. I have used Ubuntu-11.10 in this demonstration. I named one virtual machine as Ubuntu1 and other as Ubuntu 2.
$ sudo gedit /etc/hostname
$ sudo service hostname start
$ sudo gedit /etc/hosts
$ sudo vi /etc/host
Step-2: Configuring SSH on all nodes (master & slaves)
sudo apt-get install ssh
sudo apt-get install sshd
(this step is required to connect to other machines)
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Pressing yes will add localhost to known hosts
Once SSH is successfully configured on all nodes, confirm passwordless SSH connectivity from master to slave nodes and vice versa using this command.
Step-3: Install Java on all nodes (master & slaves)
sudo apt-get install openjdk-6-jdk
sudo apt-get install python-software-properties sudo add-apt-repository ppa:ferramroberto/java sudo apt-get update
sudo apt-get install sun-java6-jdk sudo update-java-alternatives -s java-6-sun
Step-4: Installing & Configuring Hadoop all nodes (master & slaves)
Hadoop will now get downloaded in the home folder
tar –xzfv hadoop-1.2.1.tar.gz
sudo gedit ~/. bashrc (befor this you may require changing the file permission by: sudo chmod 777 <filename> ) export HADOOP_HOME=/user/home/hadoop (this is just an example) export HADOOP_HOME = /home/girish/hadoop-1.2.1 (this is just an example) export PATH= $PATH:$HADOOP_HOME/bin
chown -R girish hadoop-1.2.1
(Instead of pasting it to terminal prefer typing)
chmod -R 755 hadoop-1.2.1
Step-5: Modify Hadoop configuration files (master & slaves)
Update conf/ mapred-site as below:
Update conf/ hdfs-site as below:
Update the file-master under conf directory
Change localhost by your master m/c name (in our case master machine name is master itself)
Similarly update slaves file under conf directory. Here add the name for the node that you want to be categorized as the slave node. You can add your master node name as well as slave; this will run data node on master as well.
Now copy Hadoop (already configured) to other nodes from master node
To copy from Ubuntu1 to Ubuntu2 use the below command:
scp -r hadoop-1.2.1 girish@slave:/home/girish
Step-6. Format Hadoop NameNode-
hadoop namenode –format
$ ~/hadoop-1.2.1/bin/hadoop namenode -format
Step-7. Start Hadoop daemons
Now run Hadoop using below command
Step-8. Verify the daemons are running
$ jps (if jps is not in path, try /usr/java/latest/bin/jps)
output will look similar to this
This shows we have all the daemons running.
Step-9. Verify UIs by namenode & job tracker
Verify UIs by namenode & job tracker, by following URLs
namenode UI: http://machine_host_name:50070
job tracker UI: http://machine_host_name:50030
Substitute ‘machine host name’ with either public IP or hostname of your node e.g: http://ec2……com:50070, which in our case will be like below.
namenode UI: http://master:50070
job tracker UI: http://master:50030
Now you have successfully installed and configured Hadoop in Pseudo Distributed mode.