Last Tuesday I gave a short presentation at the new Boulder Hadoopers Group about testing Hadoop jobs with MRUnit. You will have to know what Hadoop is and how to read Groovy code to fully understand it. I am including the important notes on the slides as well.
If your browser doesn’t support flash, check out the slides at slideshare
Why use MRUnit?
Testing a Hadoop job requires a lot of effort not related to the job. You must configure it to run locally, create a sample input file, run the job on your sample input, and then compare to an expected output file. This not only takes time, but makes your tests run very slow due to all the file I/O.
a unit test library designed to facilitate easy integration between your MapReduce development process and standard development and testing tools such as JUnit
With MRUnit, there are no test files to create, no configuration parameters to change, and generally less test code. You can cut the clutter and focus on the meat of your tests.
Hadoop tests are much simpler to write using MRUnit. Here’s an example of entire test class:
You can test map and reduce separately, of course. You can also easily verify counters:
There’s a mess of other cool stuff like MockReporter and MockInputSplit, but I mostly haven’t found a use for them or time to make a simple example.
Before I tell you to go grab the latest distribution, I want you to know some of the problems we’ve encountered in the “real-world”.
- First and foremost, MRUnit is not useful for streaming jobs. If you only write streaming map-reduce jobs, you’ll have to do it the old fashioned way
- Calling driver.runTest() doesn’t tell you what the failure was (it just throws an AssertionError). Instead, call def output = driver.run() and assert
- The documentation sucks. There’s only one example and the rest you basically have to figure out from the API
- setup() is called for the new Hadoop API (mapreduce packages) but not the old API (mapred packages). You have to call it yourself if you need it
- Finally, tests reuse the same JVM. So if you’re accidentally maintaining state in your job, you will be bitten!
MRUnit makes writing tests for Hadoop easier. It has drawbacks, but they are far outweighed by the benefits.
By the way, here’s how you test a streaming job: