Testing When Correctness is Poorly Defined?

views:

153

answers:

+3 Q:

Testing When Correctness is Poorly Defined?

I generally try to use unit tests for any code that has easily defined correct behavior given some reasonably small, well-defined set of inputs. This works quite well for catching bugs, and I do it all the time in my personal library of generic functions.

However, a lot of the code I write is data mining code that basically looks for significant patterns in large datasets. Correct behavior in this case is often not well defined and depends on a lot of different inputs in ways that are not easy for a human to predict (i.e. the math can't reasonably be done by hand, which is why I'm using a computer to solve the problem in the first place). These inputs can be very complex, to the point where coming up with a reasonable test case is near impossible. Identifying the edge cases that are worth testing is extremely difficult. Sometimes the algorithm isn't even deterministic.

Usually, I do the best I can by using asserts for sanity checks and creating a small toy test case with a known pattern and informally seeing if the answer at least "looks reasonable", without it necessarily being objectively correct. Is there any better way to test these kinds of cases?

Ultimately, you have to decide what your program should be doing, and then test for that.

Jay Bazuzi 2009-03-18 18:05:48

-1 He asked how to test, and you told him to test. Perhaps you could elaborate.

2009-03-18 18:07:12

+3 A:

I think you just need to write unit tests based on small sets of data that will make sure that your code is doing exactly what you want it to do. If this gives you a reasonable data-mining algorithm is a separate issue, and I don't think it is possible to solve it by unit tests. There are two "levels" of correctness of your code:

Your code is correctly implementing the given data mining algorithm (this thing you should unit-test)
The data mining algorithm you implement is "correct" - solves the business problem. This is a quite open question, it probably depends both on some parameters of your algorithm as well as on the actual data (different algorithms work for different types of data).

Grzenio 2009-03-18 18:05:50

+1 A:

Well, there are a few answers. First of all, as you mentioned, take a small case study, and do the math by hand. Since you wrote the algorithm, you know what it's supposed to do, so you can do it in a limited case.

The other one is to break down every component of your program into testable parts. If A calls B calls C calls D, and you know that A,B,C,D, all give the right answer, then you test A->B, B->C, and C->D, then you can be reasonably sure that A->D is giving the correct response.

Also, if there are other programs out there that do what you are looking to do, try and aquire their datasets. Or an opensource project that you could use test data against, and see if your application is giving similar results.

Another way to test datamining code is by taking a test set, and then introducing a pattern of the type you're looking for, and then test again, to see if it will separate out the new pattern from the old ones.

And, the tried and true, walk through your own code by hand and see if the code is doing what you meant it to do.

2009-03-18 18:06:25

+3 A:

When facing cases like this I tend to build one or more stub data sets that reflect the proper underlying complexities of the real-life data. I often do this together with the customer, to make sure I capture the essence of the complexities.

Then I can just codify these into one or more datasets that can be used as basis for making very specific unit tests (sometimes they're more like integration tests with stub data, but I don't think that's an important distinction). So while your algorithm may have "fuzzy" results for a "generic" dataset, these algorithms almost always have a single correct answer for a specific dataset.

krosenvold 2009-03-18 18:07:15

Really, the challenge here is this: because your application is meant to do a fuzzy, non-deterministic kind of task in a smart way, the very goal you hope to achieve is that the application becomes better than human beings at finding these patterns. That's great, powerful, and cool ... but if you pull it off, then it becomes very hard for any human beings to say, "In this case, the answer should be X."

In fact, ideally the computer would say, "Not really. I see why you think that, but consider these 4.2 terabytes of information over here. Have you read them yet? Based on those, I would argue that the answer should be Z."

And if you really succeeded in your original goal, the end user might sometimes say, "Zowie, you're right. That is a better answer. You found a pattern that is going to make us money! (or save us money, or whatever)."

If such a thing could never happen, then why are you asking the computer to detect these kinds of patterns in the first place?

So, the best thing I can think of is to let real life help you build up a list of test scenarios. If there ever was a pattern discovered in the past that did turn out to be valuable, then make a "unit test" that sees if your system discovers it when given similar data. I say "unit test" in quotes because it may be more like an integration test, but you may still choose to use NUnit or VS.Net or RSpec or whatever unit test tools you're using.

For some of these tests, you might somehow try to "mock" the 4.2 terabytes of data (you won't really mock the data, but at some higher level you'd mock some of the conclusions reached from that data). For others, maybe you have a "test database" with some data in it, from which you expect a set of patterns to be detected.

Also, if you can do it, it would be great if the system could "describe its reasoning" behind the patterns it detects. This would let the business user deliberate over the question of whether the application was right or not.

Charlie Flowers 2009-03-20 03:23:18

This is tricky. This sounds similar to writing tests around our text search engine. If you keep struggling, you'll figure something out:

Start with a small, simplified but reasonably representative data sample, and test basic behavior doing this
Rather than asserting that the output is exactly some answer, sometimes it's better to figure out what is important about it. For example, for our search engine, I didn't care so much about the exact order the documents were listed, as long as the three key ones were on the first page of results.
As you make a small, incremental change, figure out what the essence of it is and write a test for that. Even though the overall calculations take many inputs, individual changes to the codebase should be isolatable. For example, we found certain documents weren't being surfaced because of the presence of hyphens in some of the key words. We created tests that testing that this was behaving how we expected.
Look at tools like Fitness, which allow you to throw a large number of datasets at a piece of code and assert things about the results. This may be easier to understand than more traditional unit tests.
I've gone back to the product owner, saying "I can't understand how this will work. How will we know if it's right?" Maybe s/he can articulate the essence of the vaguely defined problem. This has worked really well for me many times, and I've talked people out of features because they couldn't be explained.
Be creative!

ndp 2010-05-18 04:19:41

ansaurus

tags:

views:

answers:

Testing When Correctness is Poorly Defined?

related questions