Testing your Hadoop jobs with MRUnit

Last Tuesday I gave a short presentation at the new Boulder Hadoopers Group about testing Hadoop jobs with MRUnit. You will have to know what Hadoop is and how to read Groovy code to fully understand it. I am including the important notes on the slides as well.

If your browser doesn’t support flash, check out the slides at slideshare

Why use MRUnit?

Testing a Hadoop job requires a lot of effort not related to the job. You must configure it to run locally, create a sample input file, run the job on your sample input, and then compare to an expected output file. This not only takes time, but makes your tests run very slow due to all the file I/O.

MRUnit is:

a unit test library designed to facilitate easy integration between your MapReduce development process and standard development and testing tools such as JUnit

With MRUnit, there are no test files to create, no configuration parameters to change, and generally less test code. You can cut the clutter and focus on the meat of your tests.

The Good

Hadoop tests are much simpler to write using MRUnit. Here’s an example of entire test class:

class ExampleTest() {
  private Example.MyMapper mapper
  private Example.MyReducer reducer
  private MapReduceDriver driver

  @Before void setUp() {
    mapper = new Example.MyMapper()
    reducer = new Example.MyReducer()
    driver = new MapReduceDriver(mapper, reducer)
  }

  @Test void testMapReduce() {
    driver.withInput(new Text('key'), new Text('val'))
        .withOutput(new Text('foo'), new Text('bar'))
        .runTest()
  }
}

You can test map and reduce separately, of course. You can also easily verify counters:

driver.withInput(...)
driver.run()

def counters = driver.getCounters()

assertEquals(1, counters.findCounter('foo', 'bar').getValue())

There’s a mess of other cool stuff like MockReporter and MockInputSplit, but I mostly haven’t found a use for them or time to make a simple example.

The Bad

Before I tell you to go grab the latest distribution, I want you to know some of the problems we’ve encountered in the “real-world”.

First and foremost, MRUnit is not useful for streaming jobs. If you only write streaming map-reduce jobs, you’ll have to do it the old fashioned way
Calling driver.runTest() doesn’t tell you what the failure was (it just throws an AssertionError). Instead, call def output = driver.run() and assert
The documentation sucks. There’s only one example and the rest you basically have to figure out from the API
setup() is called for the new Hadoop API (mapreduce packages) but not the old API (mapred packages). You have to call it yourself if you need it
Finally, tests reuse the same JVM. So if you’re accidentally maintaining state in your job, you will be bitten!

Conclusion

MRUnit makes writing tests for Hadoop easier. It has drawbacks, but they are far outweighed by the benefits.

Grab the latest MRUnit JAR

By the way, here’s how you test a streaming job:

./myMapper.py < test.input | sort | ./myReducer.py > actual.out
diff expected.out actual.out

Posted on 20 May 2010.