views:

92

answers:

2

Hello,

I am struggling a bit with how I can unit test parsing a file... Let's say I have a file with 25 columns that could be anywhere from 20-1000 records long... How do I write a unit test against that? The function takes the file as a string as parameter and returns a DataTable with the file contents...

The best I can come up with is parsing a 4 record file and only checking the top left and bottom right 'corners'... e.g. the first few fields in the 2 top records and the last few fields of the 2 bottom records... I couldn't imagine having to tediously hand-type assert statements for every single field in the file. And doing just one record and every field seems just as weak, since it doesn't account for scenarios of multiple record files or unexpected data.

That seemed 'good enough' at the time... however now I'm working on a new project which is essentially the parsing of various PDF files coming in from 10 different sources, each source has 4-6 different formats for their files, so about 40-60 parsing routines. We may eventually fully automate 25 additional sources down the road. We take the PDF and convert it to excel using a 3rd party tool.. then we sit and analyze the patterns in the output, and write the code that calls the API of the tool, takes the excel file and parses it - stripping out the garbage, sorting around data thats in different places, cleaning it etc..

How realitically can I unit test something like this?

+3  A: 

I am not sure I fully understand the problem, but here is one idea. Collect a bunch of sample files that represent diverse formats and edge cases. Run the conversion to your DataTables and manually inspect the DataTables the first time to ensure they are correct. Then serialize the DataTable's to XML format and store them in your unit test suite along with your test case PDF files.

Your automated unit tests could perform the conversion from PDF to DataTable and compare the results against the respective "approved" serialized DataTable representation.

You could build up a library of test documents over time using this method. Failures in your unit tests would indicate that changes to the parsing routines have broken a particular edge case.

There's one 'catch' though. I my first example I was talking of a .NET application. However, this new project with the 40 possibly 'scrubbing scripts' is written in VBA.... The input is an Excel Spreadsheet and the output is an Excel spreadsheet... how could I serialize this? Maybe do a checksum on the entire file????

For the second example if the Excel spreadsheets are not too complicated you could try to create a cell by cell comparison routine like this one; perhaps you could wrap this into a custom Assert.AreExcelWorksheetsEqual(). You are right though, a checksum might work just as well.

zac
This is a great idea - I hadn't thought of serializing / deserializing to XML. Then I don't need one Assert() call for every cell in the entire file.. just a single assert (or if neccesary loop) to make sure it matches
dferraro
There's one 'catch' though. I my first example I was talking of a .NET application.However, this new project with the 40 possibly 'scrubbing scripts' is written in VBA.... The input is an Excel Spreadsheet and the output is an Excel spreadsheet... how could I serialize this? Maybe do a checksum on the entire file????
dferraro
+2  A: 

When you have to build unit testing around sample of data, use second sample of expected output data. 10K lines of text or megabyte of binary. It does not matter.

You can just prepare expected input sample and output data table, no matter what size. Store it in files/scripts next to your source code. Include into test the steps of fetching the data sample, processing it and comparing output bit to bit with the expected result using some generic comparing tool or SQL statement.

RocketSurgeon