views:

361

answers:

6

What I have to do

I'm trying to manipulate some rather large amounts of data stored in Excel files (one of the workbooks has as much as 150 spreadsheets). The result of these manipulations may yield approximately 800.000 rows in a database table.

The problem

Data stored in the spreadsheets has unpredictable format. The company that generated these spreadsheets had no fixed/documented format for exporting these files, and sometimes erroneous data appear. For example most of the years are represented like "2009" but there are cases where a year is represented as "20". Other example, data is not really normalized in these files, so I use separators to split the values of certain cells. Sometimes these separators change.

There are things like these that I couldn't predict and I only discovered them only after running an already evolved version of my program over a pretty large part of the available data.

The question

How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?

Shall I take a defensive approach and throw exceptions whenever some kind of unexpected issue arises? Then the main loop of the program may catch and log them and continue with the available data? This would yield some processed data, but that means that on a subsequent iteration of the program I have to have checks for what's already inside the database from previous iterations (which I don't really like).

What's your opinion? How would you tackle this problem?

+2  A: 

If there is no specification for what the format of the data is, then anything is acceptable.

If not, then there is either an explicit or implicit specification of the data. I would try and nail this down right now. If you can't get an explicit enough definition of the data to write your program so that it can be expected to run without error, then I would say you are taking a very large risk in causing some serious damage depending on how this data is being used.

You should write your program so that it either throws an exception or logs an error whenever running across data that does not meet the specification. Then, run the program on PART of the available data until it runs without exception. This can be viewed as a training set for the development of your program. Then, use some of the saved data to use as a TEST set. This will give you an estimate of how many exceptions/errors your program will generate in production.

Overfitting is a common machine learning concept, but it is useful to other tasks such as this - program development. It is surprising to me how developers can write a bunch of unit tests, code their application to perform well on it, and then expect similar or bug-free performance in production.

If you're not willing to take all these steps (i.e. run your code on essentially all of the data -- since the test set is also making use of the data) then I would say the task is too large to do.

Larry Watanabe
+2  A: 

As an aside, rather than creating a definition of a format that is very strange and peculiar to account for all the "errors" in the current data, you might want to create a new, normalized (in the sense these things are simplified away) specification for the data, and then write a "faulty document patcher" that can be run on faulty documents to fix the data.

If the application generating the data is still in production, then you might need to go to the developers of this application to get a buy in on the new spec. Once you have that, you can then start logging bugs against their application, so hopefully the faulty document patcher can be retired.

More likely, I'm guessing that the software developers are long gone, no one understands the code anymore, if it is even running at all.

Larry Watanabe
Sounds to me like the output database itself is the new "normalized" specification for the data?
MarkJ
MarkJ is right, what I've done so far is a program that consists of two parts, helpers that normalized this data and the "actual" program that reads and inserts the data. It's just that for now their not that separated structurally. They're pretty intertwined and a little separation won't hurt at all, but I'm afraid of performance loss.
Ionuț G. Stan
Don't be afraid of a performance hit. Even if you just made 2 copies of the program, and eliminated all the code for "helping" from one, and all the code for "actual" in the other ... it would probably run just about equally fast, since it runs twice, but each run is doing only half the work. My guess is that the total run time is going to be 1 minute to an hour -- if you spend even one more hour looking at a bug because of the interaction of the "help" vs "actual" then you have lost any performance gain, since this is probably a run once program.
Larry Watanabe
+1  A: 

One question is, will you run your program more than once? From your question it sounds possible you only want to run it once, and then you will then work with the data in the database.

In which case you can be very defensive - throw exceptions whenever unexpected data appears. Run the program repeatedly on ever-larger sets of the data. Initially, solve any exceptions by altering the code, as it's a good rule of thumb that the exceptions you find first are going to be common. You might want to empty the output database between runs.

Later on, you will be finding rare exceptions that might only occur a couple of times in the input. Just solve these by hand and insert the corresponding rows in the database yourself. Or write another small program that reads your exception information and inserts the new rows, rather than running your whole big program again.

MarkJ
It's supposed to be run once, but based on previous experience I believe it will serve us on other occasions too. Otherwise, I did just what you described except for the exception parsing program, which might be a good idea.
Ionuț G. Stan
+2  A: 

How can one test the correctness of a program in such a situation? Or rather, how to achieve a pretty stable version of the product without running it over the whole available data?

For every single data type I would set reasonable constraints on the values that it is allowed to be. If a cell violates these constraints then throw an exception containing the piece of data it failed on and its data type. If a piece of data violated its constraints you can modify the source to include the additional constraints required for that piece of data, and a conversion method to make it uniform.

To give an example on the date you gave, initially a date would have the constraint that it could be only four digits. When the program came across the "20" it would throw an exception. Then you could go and allow two digit dates, and a method to convert the two-digit dates into a four digit one to allow further processing.

CiscoIPPhone
A: 

How about processing every piece of data (so you don't have to check for dupes). Those that pass go into the database. The exceptions go into an exception file. The user can open the exception file and make corrections/modifications to the data. Then they can run your program on the exception file.

This will isolate unhandled data for the user to correct and prevent you from processing the same data twice (or more).

Dick Kusleika
+1  A: 

Typically for this sort of thing I do these as @MarkJ suggested, and I encode the whole thing in unit tests.

So I compose a small datafile that at first contains only a few rows of normal data. That's unit test number 1.

Then I take a quick visual scan of some of the data to spot any obvious exceptions. Unit tests 2 through n.

Finally, I write parser code until it passes all unit tests, and throws and logs exceptions for all un-managed data.

I then use these oddball bits of data to make new unit tests, and improve the parser until it can pass those too.

Although sometimes accommodating some really strange bit of data adds more parser complexity than it's worth, and I'll just log the exception, dump it, and move on. This is a matter of professional judgment.

Sean Cavanagh