Computer Science Mapreduce Support

This page describes running MapReduce jobs on computer science systems.

All of our systems have Hadoop installations set up for "standalone", or "local" mode. That means that all jobs run on the system where you start them. Hadoop was originally implemented when computers had one or two processors. To handle large jobs, it was necessary to split the job across many computers. Today, servers have 80 - 256 cores, so you can get a lot done by using multiple cores on the same system. In comparison with a Hadoop cluster, jobs here

Always run in local mode. There's no cluster and no yarn. You can still run multiple map and reduce tasks in parallel. They just all run on the current host. If you use a larger system, e.g. ilab1, ilab2 or ilab3, there are enouogh cores to get good parallelism.
All map and reduce tasks run in a single JVM. Unlike the cluster, this JVM is not of a fixed size. It will expand to fit the jobs runninig in it.
By default, only one map and reduce task are run at a time. The defaults are designed for testing and debugging your code. For an actual run, you'll want to specify something like "-D mapreduce.local.map.tasks.maximum=10". See below.
Always uses local files. There's no HDFS

The current copy of Hadoop is version 3.4.0. It uses Java 11. Currently our default java is 17, but it doesn't appear that hadoop supports Java 17.

Python

Most of this document is about running MapReduce jobs written in Java. If you're interested in using Python, see Writing An Hadoop Mapreduce Program in Python.) Note that they show "bin/hadoop" for the hadoop command. On our systems it's just "hadoop". Also, you don't need to (and can't) copy your files to HDFS. They will work fine in your home directory. Here's the command to run the sample program on our systems:

hadoop jar /common/system/hadoop-3.4.0/share/hadoop/tools/lib/hadoop-streaming-3.4.0.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -input gutenberg/* -output gutenberg-output

Note that their sample is written in Python 2. You will need to fix it for python 3.6 and above by replacing the print statement with

    print (f'{word}\t1')

for mapper.py and

    print (f'{word}\t{count}')

for reducer.py. You'll also want to change the first line from "#!/usr/bin/env python" to "#!/usr/bin/python3"

Java

For Java, you need to assemble all of the class files needed for your program into a jar file. Simply run that file with "hadoop", e.g. to run one of the examples you can do

hadoop jar /common/system/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples.jar pi 3 10000

Everything after the jar file name is an argument, passed to the program. In this case the jar file has a number of example programs. "pi" tell it to calculation the value of pi. 3 and 10000 specify the number of maps to use and the number of samples in each map. Please don't use more than 3 maps for MapReduce jobs unless you're sure you are at a period when there's no class doing an assignment.

How to build a job

The following reference will give a sample program to try, Wordcount.java: Mapreduce Tutorial However we need to modify the instructions for building and running it slightly.

You don't need to set the environment variables mentioned. We do that for you.

If the program is in the file Wordcount.java, do

hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class

The first command compiles the program. com.sun.tools.javac.Main is the Java compiler. The output is left in a set of class files in the current directory, WordCount*.class.

The second command combines all the class files into a single jar file. The command to run jobs needs a jar file.

Note by the way that the program refers to a number of standard packages, e.g. "import org.apache.hadoop.conf.Configuration;". You don't need to worry about them. The system already has those packages. The reason you use the hadoop command to compile your program, rather than calling "javac" yourself, is that the command adds the system classes to the class path so you don't have do worry. To see what the system has builtin, look for jar files in /common/system/hadoop/share. In addition to hadoop-specific packages it has a lot of standard open-source packages, e.g. from apache commona.

Once your have built the jar file, you can run it with "hadoop jar". This example program expects you to have file "input" containing the text. It will count the words, and put the counts into a file in directory "output". The program creates the directory output, so it shouldn't already exist. If you run the program more than once, you'll have to remove "output" before running it again.

hadoop jar wc.jar WordCount input output

"hadoop jar" is the command to run a program in a jar file
wc.jar is the jar file with the program in it
WordCount is the name of the main class.

Everything after the main class name are arguments to the program. For this program

input is the input file. It can be any name you want. It can also be a directory with input files in it.
output is the directory where the program will put the output. It can be any directory name, but it must not currently exist.

Usuing multiple map tasks

As noted above, the default configuration is 1 map task. That's fine for debugging, but to do any real work you want more than one. To set the number of tasks, add

import org.apache.hadoop.util.ToolRunner;

and then in the section where job.set... are being done add something like

    LocalJobRunner.setLocalMaxRunningMaps(job, 10);

You can also set the number of reduce tasks, though that isn't as common. Use "setLocalMaxRunningReduces". Note that the system may not use that number of map tasks if you don't have enough data to use them.

In a cluster, you would also want to set the amount of memory allocated for each task. However in local mode, the JVM will expand as necessary, so that's not needed.

Using command-line parameters

You might prefer not to hardcode the number of maps, but to allow it to be specified on the command-line. There's a standard way to specify parameters, using the Java -D option. E.g.

hadoop jar wc.jar WordCount -D mapreduce.local.map.tasks.maximum=2 input out

However the program has to implement Tool to make this work. Here is a modified version of the WordCount program that does that: WordCount.java

Using a configuration file for a MapReduce job

If you want to specify options for your MapReduce job, but you don't like typing all the options each time, you can put the options in a file. For example (using parameters appropriate for a cluster, not for our installation), instead of

hadoop jar /common/system/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples.jar pi -Dmapreduce.job.reduces=9 -Dmapreduce.map.memory.mb=4096 16 100000

you could use

hadoop jar /common/system/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples.jar pi -conf mapred.xml 16 100000

This probably isn't worth it for one option, but might be if you use many. Here's mapred.xml:

<?xml version="1.0"?>
<?xml-style type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
     <name>mapreduce.job.reduces</name>
     <value>9</value>
  </property>
  <property>
     <name>mapreduce.map.memory.mb</name>
     <value>4096</value>
  </property>
</configuration>