Building a MapReduce Project with Maven
Let’s take our first step towards getting our hands dirty with Hadoop programming – Build a MapReduce project using Maven. Before, we start building the project let’s find out the answer for the following question.
WHY USE MAVEN FOR BUILDING MAPREDUCE AND OTHER HADOOP ECOSYSTEM PROJECTS?
Apache Hadoop is a java based framework and many of the Hadoop Ecosystem tools like MapReduce, Spark, HBase, Sqoop, Flume, Oozie, Storm, Kafka, Cassandra, Mahout etc. are java based as well.
Most java projects depend heavily on third party libraries that provide additional functionality to a java application. “Third party libraries” are code outside of the project itself and they come in the form of JAR (Java ARchive) files. To compile a java project that has dependencies on third party libraries, the JAR files of these libraries must be added to the classpath. Moreover, a third party JAR file might depend on another third party JAR file and that JAR file might depend on yet another third party JAR file and so forth…This is known as “transitive dependency” or “dependencies of dependency” in other words.
To download the correct versions of all the required external JAR files from remote repositories and to figure out and download the correct versions of their dependencies and setting the classpath for all the dependencies of a java project would be a daunting task. Fortunately, Apache Maven can help us in making the build process a breeze by obtaining and managing dependencies. Maven will detect the libraries that the dependencies of our java project require and automatically include them in the classpath.
Maven is a powerful build automation and project management tool primarily for java projects and its heavily used by many open source Java projects in Apache Software Foundation, Sourceforge, Google Code etc. So it’s worth investing our time in understanding how to build and manage java projects using Maven.
Before installing Maven and other necessary tools let’s familiarize ourselves with the key terms encountered in Maven.
Project Object Model (POM): Project Object Model or POM contains information about the project and configuration details used by Maven to build the project. It is an xml file located in the root directory of each project.
Local Repository: It’s is a directory in your local system where all the project jar files, library/dependency jar files, or any other project specific artifacts are stored. Basically when we build a Maven project, all the dependency files will get stored in our Maven local repository.
The default Maven local repository “.m2” gets created in the user’s home directory.
Windows machine: C:\Users\<UserName>\.m2
Central Repository: When we build a java project using Maven, it will check the project’s pom.xml file to figure out the list of dependencies to download. First, Maven will check if the dependency is available in our local repository. If it’s not found in local repository, then it will fetch it from Maven central repository located at: https://repo1.maven.org/maven2/
Artifact: An artifact is a file that is either produced or used by a java project built with Maven. A Maven build produces one or more artifacts, such as JARs, WARs, EARs etc that gets deployed to a Maven repository. It is uniquely identified by a group ID, an artifact ID, and a version string.
Maven Coordinates: In Maven, group identifier, artifact identifier, and version coordinates uniquely identify a project, a dependency or a plugin in a POM file. These three identifiers are together known as Maven Coordinates.
groupId is a unique identifier for an organization or a project. It resembles the reverse DNS of your web site, and can contain subgroups as per need: for instance, net.bytequest.bigdataoncloud and net.bytequest.bigdataoncloud.wordcount are valid groupIds
artifactId is the name of the project.
version uniquely identifies the version of the project.
The POM defined below is the bare minimum required for a java project built using Maven. It specifies the Maven coordinates for a project.
A POM file will be invalid without the Maven coordinates: groupId, artifactId, and version.
Packaging: It’s the artifact type of the project. Some of the valid packaging values are jar, war, ear and pom.
Based on the packaging of the above POM file the artifact generated by Maven build will be a WAR file. When no packaging is declared Maven assumes the default packaging: jar.
Dependencies: A typical Java project depends on libraries to build and/or run. These libraries are known as “dependencies” in Maven. The type of a dependency corresponds to the dependent artifact’s packaging type. The default dependency type is jar.
Plugins: Plugins are additional components that add functionality to Maven. For example, to run (execute) a java project maven uses exec-maven-plugin. Plugins extend the functionality of Maven whereas Dependencies extend the functionality of a java project.
Snapshots: A work-in-progress version of a project is known as Snapshot. This version can be named as “SNAPSHOT” which indicates the latest development version, or like “1.1-SNAPSHOT” which indicates the development that will be released as version 1.1.
Maven could be installed as a command line application or as a plugin of an IDE like Eclipse. We could use either the standalone version or the Eclipse plugin.
As a developer, the main benefit of using Maven as an Eclipse plugin is that we could easily launch Maven builds from within Eclipse. The classpath of our Eclipse project will be automatically set for us based on Maven’s POM.xml. There is no need to manually tweak the classpath.
Build engineers do official builds using maven command line.
We will invest our time in installing Eclipse plugin for Maven. Eclipse plugin for Maven is called as “m2e”. This plugin comes bundled up with Maven 3.
Let’s watch the following video tutorial to install Eclipse with Maven.
BUILDING THE PROJECT
Now that we have installed Eclipse with built-in Maven support we can go ahead and build our first Hadoop MapReduce project.
Step 1: Access the GitHub URL that contains the source code and input data for our Mapreduce WordCount program: https://github.com/bytequest/bigdataoncloud
Step 2: Click the “Clone or download” button.
Step 3: Click the “Download Zip” link.
Step 4: It will download a file called bigdataoncloud-master.zip. Save the file in your computer.
Step 5: Extract the zip file to C:/projects or any other folder where you would like to keep your project source code.
Step 6: Launch your Eclipse IDE. Under File menu select the following options: Import -> Maven -> Existing Maven Projects and click “Next”
Step 7: Browse the directory where you have copied the wordcount folder. The pom file check box will be pre-checked. Click the “Finish”” button.
Step 8: We have successfully imported the wordcount project in Eclipse. If we expand the project we should be able to see the project folders, pom.xml, Maven Dependencies etc.
If we expand the Maven Dependencies we should be able to see all the dependency jar files for our project that Maven has downloaded for us seamlessly.
If we expand the src/main/java folder we should be able to see the below 3 java files.
Here is the pom.xml file for wordcount project:
<?xml version=“1.0” encoding=“UTF-8”?>
<project xmlns=“http://maven.apache.org/POM/4.0.0” xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=“http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd”>
<!— hadoop mapreduce artifacts —>
The input data set used for our wordcount program is from The Project Gutenberg EBook of The Complete Works of William Shakespeare, by William Shakespeare
Step 9: We are ready to compile and run wordcount. Right click on pom.xml and choose Run as -> Maven Clean.
Maven Clean deletes the files and directories in the target folder generated by Maven from the previous build.
Step 10: We need to now compile our source code by running Maven Install.
From the Eclipse console we can see that 3 java src files have been successfully compiled and wordcount-0.0.1.jar has been copied to our local repository.
Step 11: Now it’s time to run our wordcount program. We need to simply execute Maven Build.
This will launch the Edit Configuration and Launch Window. In the “Goals” text box enter “exec:java” and click the “Apply” button and then hit the “Run” button.
Here is the Eclipse console output from the Maven build of our wordcount program.
Step 12: We have to refresh the project root to see the newly generated “output” folder. It contains a part-r-00000 text file that has the count of 67780 words including special characters from the “Complete Works of Shakespeare”.
Hurray! we have successfully imported, compiled and ran our Hadoop Mapreduce wordcount program. Basically, we have run Hadoop on a single JVM instance. This is also known as running Hadoop in a stand-alone mode or a non-distributed mode where Hadoop runs as a single java process.
In this post, we learned the benefits of using Maven, Maven basics and building our first Hadoop MapReduce project with Maven. In the following post Run a MapReduce Job in Pseudo-Distributed mode we will learn about running a Hadoop MapReduce job in pseudo-distributed mode. If you have any questions or comments regarding this blogpost or would like to share your experience on building a Hadoop project with Maven, please feel free to post it in the comment section below.