I'm trying to figure out how to test-drive software that launches external processes that take file paths as an input and write output after lengthy processing to stdout or some file? Is there some common patterns on writing tests in this kind of situations? It is hard to create fast executing tests that could verify correct usage of external tools without launching actual tools in tests and inspecting the results.
Assuming the external programs are well-tested, you should just test that your program is passing the correct data to them.
You could memoize (http://en.wikipedia.org/wiki/Memoization) the external processes. Write a wrapper in Ruby that computes the md5 sum of the input file and checks it against a database of known checksums. If it matches one, copy over the right output; otherwise, invoke the tool normally.
Test right up to your boundaries. In your case, the boundary is the command-line that you construct to invoke the external program (which you can capture by monkey patching). If you're gluing yourself in to that program's stdout (or processing its result by reading files) that's another boundary. The test is whether your program can process that "input".
I think you are getting into a common issue with unit testing, in that correctness is really determined by if the integration works, so how does the unit test help you?
The basic answer is that the unit test tests that the parameters you intend to pass to command line tool are in fact getting passed that way, and that the results you anticipate getting back are in fact processed the way you intend to process them.
Then there is a second level of tests, which may or may not be automated (preferably they are, but it does depend on if it is practical), which are at the functional level where the real utilities are called so that you can see that what you intend to pass and what you anticipate getting back match what actually happens.
There would also be nothing wrong with a set of tests which "tests" the external tools (which perhaps run on a different schedule, or only when you upgrade those tools) which establish your assumptions, passing in the raw input and asserting that you get back the raw output. That way if you upgrade the tool you can catch any behavior changes which may affect you.
You have to decide if those last set of tests are worthwhile or not. It very much depends on the tools involved.
The 90%-case answer would be to mock the external command-line tools and verify that the right input is being passed to them at the dividing interface between the two. This helps keep the test suite fast. Also you shouldn't have to bring in the command-line tools since they are not 'your code under test' - it brings in the possibility that the unit test could fail either due changes to your code or some change in behavior in the command line utility.
But it seems like you're having trouble defining the 'right input' - in which case using Optimizations like Memoization (as Dave suggests) might give you the best of both worlds.