views:

43

answers:

2

A long time ago, I had to test a program generating a postscript file image. One quick way to figure out if the program was producing the correct, expected output was to do an md5 of the result to compare against the md5 of a "known good" output I checked beforehand.

Unfortunately, Postscript contains the current time within the file. This time is, of course, different depending on when the test runs, therefore changing the md5 of the result even if the expected output is obtained. As a fix, I just stripped off the date with sed.

This is a nice and simple scenario. We are not always so lucky. For example, now I am programming a writer program, which creates a big fat RDF file containing a bunch of anonymous nodes and uuids. It is basically impossible to check the functionality of the whole program with a simple md5, and the only way would be to read the file with a reader, and then validate the output through this reader. As you probably realize, this opens a new can of worms: first, you have to write a reader (which can be time consuming), second, you are assuming the reader is functionally correct and at the same time in sync with the writer. If both the reader and the writer are in sync, but on incorrect assumptions, the reader will say "no problem", but the file format is actually wrong.

This is a general issue when you have to perform functional testing of a file format, and the file format is not completely reproducible through the input you provide. How do you deal with this case?

+1  A: 

In the past I have used a third party application to validate such output (preferably converting it into some other format which can be mechanically verified). The use of a third party ensures that my assumptions are at least shared by others, if not strictly correct. At the very least this approach can be used to verify syntax. Semantic correctness will probably require the creation of a consumer for the test data which will likely always be prone to the "incorrect assumptions" pitfall you mention.

Perry
+1  A: 

Is the randomness always in the same places? I.e. is most of the file fixed but there are some parts that always change? If so, you might be able to take several outputs and use a programmatic diff to determine the nondeterministic parts. Once those are known, you could use the information to derive a mask and then do a comparison (md5 or just a straight compare). Think about pre-processing the file to remove (or overwrite with deterministic data) the parts that are non-deterministic.

If the whole file is non-deterministic then you'll have to come up with a different solution. I did testing of MPEG-2 decoders which are non-deterministic. In that case we were able to do a PSNR and fail if it was above some threshold. That may or may not work depending on your data but something similar might be possible.

Steve Rowe
It depends. In my specific case of the RDF file, I cannot really know how the object will be serialized to XML. I can mask the variable parts and compare the rest, but the order could be different. However, your answer is interesting. The MPEG-2 testing is clearly one case.
Stefano Borini