November 18, 2013
How to Configure Apache Hadoop in Pseudo Distributed Mode
My first tutorial briefly introduced Apache Hadoop, talked about the different types of modes in which you can execute Hadoop, outlined the prerequisites of setting up Hadoop correctly and explained in detail how to setup Apache Hadoop in Standalone Mode. In this second tutorial I will illustrate the steps required to setup Apache Hadoop in Pseudo Distributed Mode.
Installing & Configuring Hadoop in Pseudo Distributed Mode
Step-1: Configuring master & slave nodes
We will be using two machines one as a master and the other one as slave. I have used Ubuntu-11.10 in this demonstration. I named one virtual machine as Ubuntu1 and other as Ubuntu 2.
- Change the hostname of these machines using the command:
$ sudo gedit /etc/hostname
- Give Ubuntu1 hostname as master and Ubuntu2 as slave.You can verify the hostname by executing the command: $ hostname
- If the updated name fails to appear then restart the hostname service using the command:
$ sudo service hostname start
- Now edit the hostentires in both machines, using command:
$ sudo gedit /etc/hosts
$ sudo vi /etc/host
- Add master and slave machine IP & name these files as192.168.118.149 master192.168.118.151 slave
Step-2: Configuring SSH on all nodes (master & slaves)
- Install SSH on all nodes using the commands.
sudo apt-get install ssh
sudo apt-get install sshd
(this step is required to connect to other machines)
- Generate ssh keylssh-keygen -t rsa -P “” (press enter when asked for file name; this will generate a passwordless ssh file)
- Now copy the public key (id_rsa.pub) of current machine to authorized_keysExecuting the following command will copy the generated public key in the .ssh/authorized_keys file.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- Verify ssh configuration using the command
Pressing yes will add localhost to known hosts
Once SSH is successfully configured on all nodes, confirm passwordless SSH connectivity from master to slave nodes and vice versa using this command.
Step-3: Install Java on all nodes (master & slaves)
- First check if Java is already installed or not by running the command
- If java is not installed or the version is not appropriate, install Java using the command
sudo apt-get install openjdk-6-jdk
- Install the utility python softwares by running these commands
sudo apt-get install python-software-properties sudo add-apt-repository ppa:ferramroberto/java sudo apt-get update
- Install sun java
sudo apt-get install sun-java6-jdk sudo update-java-alternatives -s java-6-sun
Step-4: Installing & Configuring Hadoop all nodes (master & slaves)
- Hadoop installation is required on all nodes, You may download a stable version of Hadoop from this link (Releases > Download> Download a release now) http://start.ubuntu.com/11.04/Google/?sourceid=hp
- Copy the link Location- hadoop-1.2.1.tar.gz (hadoop-1.0.1.tar.gz) http://mirror.reverse.net/pub/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
Hadoop will now get downloaded in the home folder
- Go to your home folder and extract the downloaded Hadoop tar file using the command.
tar –xzfv hadoop-1.2.1.tar.gz
- HADOOP_HOME is in 1.2.1 version but if you are using any older versions you might want to set HADOOP_HOME to the latest version using the commands.
sudo gedit ~/. bashrc (befor this you may require changing the file permission by: sudo chmod 777 <filename> ) export HADOOP_HOME=/user/home/hadoop (this is just an example) export HADOOP_HOME = /home/girish/hadoop-1.2.1 (this is just an example) export PATH= $PATH:$HADOOP_HOME/bin
- Set Java_Home in conf/hadoop-env.sh file as belowUsually Java is installed at /usr/lib/jvmPick the sun java, right click on THIRDPARTYLICENSEREADME.txtAnd see the Location value, should look like- /usr/lib/jvm/java-6-sun
- Change the Hadoop home directory privileges
chown -R girish hadoop-1.2.1
(Instead of pasting it to terminal prefer typing)
chmod -R 755 hadoop-1.2.1
Step-5: Modify Hadoop configuration files (master & slaves)
- Create a dir hdfs with subdirectories as data, name and temp
- Create a dir tempdir under home directory of the user (you want to use for hadoop)
- Update conf/core-site.xml as below:
- Change <name>hadoop.tmp.dir</name> value to /home/girish/tempdirLike below:<value>/home/girish/tempdir</value>
- Change localhost<name>fs.default.name</name><value>hdfs://localhost:54310</value>Like below:<value>hdfs://master:9000</value>
Update conf/ mapred-site as below:
Update conf/ hdfs-site as below:
Update the file-master under conf directory
Change localhost by your master m/c name (in our case master machine name is master itself)
Similarly update slaves file under conf directory. Here add the name for the node that you want to be categorized as the slave node. You can add your master node name as well as slave; this will run data node on master as well.
Now copy Hadoop (already configured) to other nodes from master node
To copy from Ubuntu1 to Ubuntu2 use the below command:
scp -r hadoop-1.2.1 girish@slave:/home/girish
Step-6. Format Hadoop NameNode-
- Format the master machine name node using below commandGo to dir /hadoop-1.2.1/binand then
hadoop namenode –format
- Execute the below command from hadoop home directory
$ ~/hadoop-1.2.1/bin/hadoop namenode -format
Step-7. Start Hadoop daemons
Now run Hadoop using below command
Step-8. Verify the daemons are running
$ jps (if jps is not in path, try /usr/java/latest/bin/jps)
output will look similar to this
This shows we have all the daemons running.
Step-9. Verify UIs by namenode & job tracker
Verify UIs by namenode & job tracker, by following URLs
namenode UI: http://machine_host_name:50070
job tracker UI: http://machine_host_name:50030
Substitute 'machine host name' with either public IP or hostname of your node e.g: http://ec2......com:50070, which in our case will be like below.
namenode UI: http://master:50070
job tracker UI: http://master:50030
Now you have successfully installed and configured Hadoop in Pseudo Distributed mode.