Learning Hadoop
I have been trying to get my head round Hadoop, specifically Map-Reduce. I set everything up a while ago and had already started forgetting important things so this is an attempt to get down all the things I learned today before I forget them, in no particular order.
Writing a map reduce program
When I installed Hadoop barely a month ago it was at version 0.20.1, as of writing this, the stable release is 0.20.2.
Unfortunately the official hadoop tutorial at apache is already out of date and will show the use of depreciated calls. Thankfully Yi Wang has updated it for 0.20.1.
Overview of a map reduce program
In Yi and the apache tutorial the entire map reduce program is contained within a single outer class. Within this there are two inner classes, one for handling the map and one for handling the reduce stage. There is also a main in the container class, this main function seems to contain all the job configuration options. These seem to be:
- job.setJarByClass(UrlCollector.class);
- job.setMapperClass(Map.class);
- job.setReducerClass(Reducer.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
The Mapper
In these examples the mapper class is a public static inner class which extends the hadoop core Mapper object located in org.apache.hadoop.mapreduce.Mapper. This class is generic and so you pass it some type information at the class declaration
public static class Map extends Mapper <Object, Text, Text, IntWritable>{
According to the javadoc the type parameters are: the input keys type, the input values type, the output keys type and the output values type.
The mapper class (Map) doesn’t have a constructor but simply overrides the Mapper method map
@Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
The Map.map() method takes a key and a value typed similarly to the Mapper<ik,iv,ok,ov>. It also takes an argument of type Context. The Context type is an inner class of Mappable and it seems to be where you pass back output key-values for passing on to the reduce stage. So in this example we are contracted to return a Text key and a IntWriteable value so the end section of the map method would roughly look like this:
Text t = new Text(); t.set("some-text-based-key"); IntWritable i = new IntWriteable(1); context.write(t, i);
So to summarise: the mapper class extends Mapper and implements its own map method. The map method takes a key and a value as well as a context. You process your keys and values in whatever way your mapping phase requires and output the intermediate keys and values via the context object using the write(k,v) method.
The Reducer
Much the same as the mapper, the reducer is a public static inner class and it extends Reducer<keyin, valuein, keyout, valueout>. As you can see Reducer has its types parameterized too. The reducer class (our one) overrides Reducer.reduce(Object, Iterable, Context). Before attempting to explain any further I want to quickly backtrack to what will happen when this is run.
First up, of course, the mapper is called, does all its funky mapping and spits out a bunch of keys and values. So when all of the mappers have finished the mapping and there output is sorted on the key, the reducer phase gets to start. The framework then starts moving the mappers output over to the reducer. The will entail collecting up all the keys that are the same across the whole cluster and merge sorting them into a file for the reducer to ingest.

In this example we have 3 possible key values a, b and c. After each mapper has run all the values keyed on a get copied to the green intermediary file, they are also sorted during this process so the green intermediary file ends up with all the a values in order. The same happens for the b values, which end up in the red file and c values which end up in the blue/green file at the bottom. In this case each reducer then just deals with one intermediary file reducing down all the a values into a final output file. To contextualise it, the word count example would fill the green file with a stack of keys and values that would all be hello 1. The reducer stage would pick up that file, run through it summing all the values for that key. Thus the output file would contain the key, hello, and the sum of the intermediary values i.e. the number of occurrences of hello.
Given this brief detour we can now say a little more about the arguments for the reduce method. The first argument is the key, its type being the type we specified when extending Reduce<ik, iv, ok, ov>. The next argument is an Iterator type structure of the type we specified for the input values when extending the reduce. If we look back at the detour this should make a little more sense as what we are getting is a bunch of values that have a shared key. Thus the argument for this method gives us a key and the bunch of values associated with it. The final argument, like with the mapper, is a context object and this is what we write are reduced key, value pairs back to, to produce the final output.
Compiling
So I knocked up this code in netbeans and hit clean and build, ran through all the steps I’m about to describe and simply didn’t work. I kept getting
Exception in thread…
Caused by: java.util.zip.ZipException: error in opening zip file
…
So the only lesson I could draw is that I need to compile it manually to get it to work. Before compiling the .java file I added some variables to my bash.rc file to make some of it a little easier:
#hadoop path export HADOOP_HOME=/usr/local/hadoop export HADOOP_VERSION=0.20.1 export PATH=$PATH:/usr/local/hadoop/bin
Now to compile the .java file for hadoop
- Copy the .java file over to my hadoop box (not running it locally yet)
- In the same folder as .java file create a folder called classes (mkdir classes)
- Then call the compiler
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar:commons-cli-1.2.jar -d classes UrlCollector.java && jar -cvf urlcollector.jar -C classes/ .
So lets dissect this call bit by bit.
- javac – thats the java compiler.
- -classpath – here we give it a path to the libraries, and here we use some of the variables we configured in the bash.rc file. we are using the variables to basically write out the path to hadoop core jar file, which on my machine is /usr/local/hadoop/hadoop-0.20.1-core.jar.
- :someOtherJar – we separate other libraries with a colon, in this case the apache commons cli library, which seems to be a dependancy, not sure about that yet though.
- -d classes- implies that we are providing a folder for all the class files to go in (remember step two?)
- UrlCollector.java - that was the name of the java file I needed to compile – the one containing the map and reduce functions as well as the main method.
- Not totally sure about the rest except it seems to specify we want a jar and its name (urlcollector.jar). But all importantly DO NOT forget the period (.) at the end. Without that seemingly insignificant little dot it wont work.
Great so now we have a jar file that should work and we have some input files – in the case of the word count tutorial this is a couple of files with some words in them!
Running the job
So we have all our bits and pieces and we can now get ready to run this job. The first thing to note is that we must copy our input files into the hadoop file system but we leave our jar on the host file system. Now that we have set the path to the hadoop bin we can call the hadoop command more easily. Lets copy up the source files
hadoop dfs -copyFromLocal /local/data /hadoop/fs/location/
Now we are ready for the magic, lets run the jar
hadoop jar urlcollector.jar com.maxgarfinkel.mapReduce.UrlCollector /user/hadoop/urlCol/in /user/hadoop/urlCol/out
Lets break this down.
- hadoop jar - We call jar which is basically the run this jar command.
- urlcollector.jar – Thats the name of the jar I want to run (include the path if its not in your working directory)
- com.maxgarfinkel.mapReduce.UrlCollector - That was the package name I had my UrlCollector.java class in and the class name.
- /user/hadoop/urlCol/in - that is the hdfs path to the folder containing the input data we copied up in the previous step.
- /user/hadoop/urlCol/out – this is the hdfs path to the folder where I want the output to be put. I haven’t created this folder, I will let hadoop take care of that.
So this is it written out in a more generic manner
hadoop jar local/path/to/jar full.package.and.class.name hdfs/source/data/location hdfs/output/location
So once our job has run we can pick up our output using the hadoop dfs command copyToLocal.
About this entry
You’re currently reading “Learning Hadoop,” an entry on random()
- Published:
- 26.02.10 / 8pm
- Category:
- Hadoop
1 Comment
Jump to comment form | comments rss [?] | trackback uri [?]