Run a MapReduce job in Pseudo-Distributed Mode
In Building a MapReduce Maven Project with Eclipse post, we learned how to run a Hadoop MapReduce job in Standalone mode. In this post, we will figure out on how to run a MapReduce job in Pseudo-Distributed Mode. Prior to delving into the details, let’s understand the various modes in which a Hadoop job can be run.
MODES OF HADOOP
We can run Hadoop jobs in three different modes: Standalone, Pseudo-Distributed and Fully distributed. Each mode has a well defined purpose. Let’s take a look at the characteristics and the usage for each one of them.
Standalone or LocalJobRunner:
- In standalone mode, Hadoop runs in a single Java Virtual Machine (JVM) and uses the local file system instead of the Hadoop Distributed File System (HDFS).
- There are no daemons running in this mode.
- The jobs will be run with one mapper and one reducer.
- Standalone mode is primarily used to test the code with a small input during development since it is easy to debug in this mode.
- Standalone mode is faster than pseudo distributed mode.
Pseudo-Distributed or Single Node Cluster:
- Pseudo distributed mode simulates the behavior of a cluster by running Hadoop daemons in different JVM instances on a single machine.
- This mode uses HDFS instead of the local file system.
- There can be multiple mappers and multiple reducers to run the jobs.
- In pseudo distributed mode a single node will be used as Master Node, Name Node, Data Node, Job Tracker & Task Tracker. Hence, the replication factor is just one.
- Configuration is required in the following Hadoop configuration files: mapred-site.xml, core-site.xml, hdfs-site.xml.
- Pseudo distributed mode is primarily used by developers to test their code in a simulated cluster environment.
Fully Distributed or Multi Node Cluster:
- Fully distributed mode offers the true power of Hadoop – distributed computing capability, scalability, reliability and fault tolerance
- In fully distributed mode, the code runs on a real Hadoop cluster (production cluster) with several nodes.
- All daemons are run in separate nodes.
- Data will be distributed across several nodes.
Now that we are aware of the various modes in which a Hadoop job can be run, let’s roll up our sleeves and work our way to run a Hadoop MapReduce job in a pseudo distributed mode.
RUNNING A MAPREDUCE JOB IN A PSEUDO DISTRIBUTED MODE
Step 1: We need to have Oracle Virtual Box and Cloudera QuickStart VM installed in our computer. If it has not been installed let’s get started with How to Install Hadoop on Windows with Cloudera VM.
Step 2: We have to launch the virtual box and turn on the Cloudera QuickStart Virtual Machine.
Before we can build our project, we need a copy of the wordcount MapReduce project in our QuickStart VM. Git command line interface is a tool that we could employ for this purpose. Cloudera QuickStart VM has a pre-installed version of git command line interface (CLI). Note: Git is the most widely used open source version control system among the developer community.
Step 3: Let’s open a terminal window and issue the below git clone command to get a copy of the source code and input data for our MapReduce WordCount program from the following GitHub URL: https://github.com/bytequest/bigdataoncloud
Step 4: Let’s verify if the wordcount project has been copied to our file system successfully. We can run unix ls command to make sure if the wordcount project directory has been created.
Let’s a take a look at the contents of wordcount directory.
Step 5: Now that we have the source code ready for our wordcount project, we can go ahead and a) import it as an existing Maven project in Eclipse and b) run Maven Clean and c) Maven Install. Note: If you need help with importing and compiling an existing Maven project please see this post: Building a MapReduce program with Maven and Eclipse
The Maven Install step will produce the jar file “wordcount-0.0.1.jar” that we need to run our wordcount program. It will be located in the target directory under wordcount directory. Let’s run the following command in a terminal window to make sure if the jar file exists.
Step 6: Now, we have the jar file ready to run our wordcount MapReduce program. To run it in a pseudo distributed mode, we need to run it in HDFS instead of issuing Maven Build command in Eclipse.
Before running wordcount program in HDFS, we have to copy the input text file in HDFS. Let’s first create an input directory “dataset” in HDFS by running the hadoop fs mkdir command.
Now we can copy our input file to HDFS dataset directory by running the hadoop fs put command.
Step 7: We can use the Hadoop UI tool “Hue” in Cloudera QuickStart VM to verify if the input text file has been copied to HDFS . First, let’s open a browser window and select “Hue” tab.
After Hue loads, we have to click on the “File Browser” icon.
We will be able to see the input file “WorksOfShakespeare.txt” residing under “dataset” directory.
Step 8: Now that we have copied the input file to HDFS, we can go ahead and run the wordcount MapReduce program in HDFS using the Hadoop jar command.
The “Hadoop jar” command runs a program contained in a JAR file. Here’s it’s syntax:
hadoop jar <jar file name> [<arguments>]
We will issue the below command in a terminal window.
In the command that we issued, we passed the name of the wordcount jar file (that we created using Eclipse) followed by three arguments:
first argument ==> Main java class of wordcount project: net.bytequest.bigdataoncloud.wordcount.WordCountDriver
second argument ==> input text file: dataset/WorksOfShakespeare.txt
third argument ==> output directory: “output”
The below two screenshots are the console output from executing the Hadoop jar command.
Step 9: The MapReduce job ran successfully. Now, let’s verify if the “output” directory is created in the HDFS by running the following command.
An output file is expected for each reducer in a MapReduce program. Since there was only one reducer for this job, we should only see one part-* File.
Step 10: Let’s browse the contents (first few lines) of the output file “part-r-00000” by piping the output of hadoop fs cat command to head command.
We should see the same output as when we ran the MapReduce job in standalone mode as mentioned in the post: Building a MapReduce program with Maven and Eclipse
Step 11: To find out the total number of lines in part-r-00000 output file, we can pipe the output of hadoop fs cat command to wc command.
As we can see, there are 67779 lines in the part-r-00000 file.
Step 12: We could also copy the output from HDFS to the local file system. To do that let’s first create an “output” directory under wordcount directory in the local file system and then issue hadoop fs get command to copy part-r-00000 file from HDFS to our local file system. Let’s run the following commands in a terminal window.
Step 13: To make sure we have copied part-r-00000 file in our local system let’s run the following command.
JOB HISTORY AND LOGS
Step 14: Now that we successfully ran our MapReduce job in pseudo distributed mode, let’s spend some time to view the list of completed jobs, their job history, logs of map tasks, logs of reduce tasks etc. by pointing our browser to http://localhost:8088. Let’s follow the links encircled in orange in each screenshot.
Hurray! We have successfully ran a MapReduce job in pseudo-distributed mode and were able to view job summary and job logs.
In this post, we learned about the various modes of Hadoop, running a MapReduce job in pseudo- distributed mode and access the job summary and job logs. If you have any questions or comments regarding this blogpost or would like to share your experience on running a Hadoop job in pseudo-distributed mode, please feel free to post it in the comment section below.