A Quick Set-Up Guide for Single Node Hadoop Clusters

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs. Hadoop uses MapReduce computing.

Creating Single Node Clusters


GNU/Linux is supported as a development and production platform.

Java™ must be installed. The recommended Java versions are described at HadoopJavaVersions.

ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.

Downloading Hadoop

To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors.


Begin by unzipping the Solr. Below is an example with a shell in UNIX:

hadoop-2.5.2.tar.gz/:$ unzip -q hadoop-2.5.2.tar.gz
/:$ cd hadoop-2.5.2/

Hadoop Distributed File System (HDFS)

HDFS is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems, including enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

The Hadoop Distributed File System: Architecture and Design



The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open hadoop-env.sh (The full path is /hadoopfolder/etc/hadoop), where Hadoop folder belongs to the folder that you have extracted, and check whether the JAVA_HOME environment variable is set or not.

1. Conf/*-site.xml

In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

The file is located at hadoopfolder/etc/hadoop where the Hadoop folder belongs to the folder that you have extracted. We can set the hadoop.tmp.dir parameter so that it denotes the name of the directory in which the Hadoop will store the data. The other parameter that we need to set is fs.defaultFS, which denotes the name of the default file system. More Parameters can also be set.







<property> <



<!-- <description>The Name of tegh default file system</description>-->



2. Conf/* mapred-site.xml

The next file to be configured is mapred-site.xml



<value>localhost:54311</value><!--localhost denotes the hostname-- > <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.



We can also set other Parameters.

3. Conf/* hdfs-site.xml

The next file to be configured is hdfs-site.xml




<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.



Formatting the HDFS File System via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop file system, which is implemented on top of the local file system of your “cluster.” You need to do this the first time you set up a Hadoop cluster.

To format the file system (which simply initializes the directory specified by the dfs.name.dir variable), run the command.

snagpal@snagpal$ /hadoop/bin/hadoop namenode -format

In this command, Hadoop is the folder in which we have extracted the Hadoop.

Starting Your Single-Node Cluster

This will start up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

To check the SSH status on your machine, use the following command:

snagpal@snagpal$ sudo /etc/init.d/sshd status

To start the single-node cluster, use the following command:

snagpal@snagpal$ /hadoop/sbin/start-all.sh

You can see the results on your browser by using the following URL: http://<hostname>/50070


Running a MapReduce Job

We will now run your first Hadoop MapReduce job. We will use the Word Count Example job, which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.

1. Make a Directory: We need to create a directory on HDFS so that we can copy the data from local to server for further processing.

/bin/hdfs dfs -mkdir /

In the above command, dirname denotes the name of the directory. For example: bin/hdfs dfs-mkdir/sumit.

The directory can be seen on a browser as such:


2. Copy data to HDFS: Before we run the MapReduce job, we need to copy the data to HDFS. We will copy the data from local files to the HDFS directed that we created earlier.

/bin/hadoop fs -put

<inputfile>: tThis contains the path of the file that needs to be copied to the HDFS directory.

<destination>: This contains the path of the output HDFS directory where we need to put the data.

For example: /Hadoop fs -put/home/snagpal/Desktop/test.txt sumit.


3. Run the MapReduce job to count words: Now we can run the MapReduce job that will count the words.

/bin/hadoop jar /hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount

<input>: This denotes the name of the directory where we have copied the input files on the HDFS.

<output>: This denotes the output directory where the results will be copied.

For example: /hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount/sumit/output

hadoop_5 hadoop_6

The output directory will be created and the result will only be in the output folder.


The following command can be used to stop the Solr server:

snagpal@snagpal$ /hadoop/sbin/stop-all.sh

Sumit Nagpal

Sumit Nagpal

Sumit Nagpal has spent the last year and a half working as a Senior Software Engineer with 3Pillar Global. He has completed a SCJP1.6 certification and has a total of 4 years of programming experience in Java technologies like Servlets, JSPs, J2EE, EJB, Struts, Spring, Hibernate, Web Services, and JAXB, among others. He also has experience in the media advertisement domain working with large-scale clients to create different products.

2 Responses to “A Quick Set-Up Guide for Single Node Hadoop Clusters”
  1. Rohit on

    Nice SUMIT ,
    thanks for such a important article.

  2. Amar Gehlot on

    Hi Sumit,

    I follow your tutorials for Hadoop single & multi nodes setup. I done secessful this setup with help of your documents & trips.

    Very good & written in simple language to understand.

    Thanks l lots for your good articles.

    Thanks again.

Leave a Reply

Related Posts

The 3 Keys to Building Products That Drive Retention –... I had the privilege of being invited to speak at the Wearable Technology Show in Santa Clara this week, where I gave a bit of a reprisal of a talk I d...
High Availability and Automatic Failover in Hadoop Hadoop in Brief Hadoop is one of the most popular sets of big data processing technologies/frameworks in use today. From Adobe and eBay to Facebook a...
3Pillar CEO David DeWolf Quoted in Enterprise Mobility Excha... David DeWolf, Founder and CEO of 3Pillar Global, was recently quoted in a report by Enterprise Mobility Exchange on the necessity of understanding and...
How the Right Tech Stack Fuels Innovation – The Innova... On this episode of The Innovation Engine podcast, we take a look at how choosing the right tech stack can fuel innovation in your company. We'll talk ...
The Road to AWS re:Invent 2018 – Weekly Predictions, P... For the last two weeks, I’ve been making predictions of what might be announced at AWS’ upcoming re:Invent conference. In week 1, I made some guesses ...