tags:

views:

136

answers:

8

Imagine that you have an internally controlled list of vendors. Now imagine that you want to match unstructured strings against that list. Most will be easy to match, but some may be reasonably impossible. The algorithm will assign a confidence to each match, but a human needs to confirm all matches produced.

How could this algorithm be unit tested? The only idea I have had so far is to take a sample of pairs matched by humans and make sure the algorithm is able to successfully match those, omitting strings that I couldn't reasonably expect our algorithm to handle. Is there a better way?

+1  A: 

You can also test if the confidence of strings your algorithm won't handle well is sufficiently low. In this way you can see if there is a threshold over which you can trust your algorithm as well.

Janusz
That's an excellent idea. Confidence numbers should be covered in the tests, (I hadn't thought of that) and this sounds like a good way to do it.
Stuart Branham
A: 

I don't think there's a better way than what you describe; effectively, you're just using a set of predefined data to test that the algorithm does what you expect. For any very complicated algorithm which has very nonlinear inputs and outputs, that's about the best you can do; choose a good test set, and assure that you run properly against that set of known values. If other values come up which need to be tested in the future, you can add them to the set of tested values.

McWafflestix
A: 

That sound fair. If it's possible (given time constraints) get as large of a sample of human matches as possible, you could get a picture of how well your algorithm is doing. You could design specific unit tests which pass if they're within X% of correctness.

Best of luck.

Jon
+1  A: 

An interesting exercise would be to store the human answers that correct your algorithm and try to see if you could improve your algorithm to not get them wrong.

If you can, add the new matches to the unit tests.

Hans Malherbe
That would be a fun exercise, actually. Unfortunately I don't think time constraints will allow, but it's something to "put on the list" nonetheless.
Stuart Branham
A: 

I think there are two issues here: The way your code behaves according to the algorithm, and the way the algorithm is successful (i.e does not accept answers which a human later rejects, and does not reject answers a human would accept).

Issue 1 is regular testing. Issue 2 I would go with previous result sets (i.e. compare the algorithm's results to human ones).

Avi
A: 

What you describe is the best way because it is subjective what is the best match, only a human can come up with the appropriate test cases.

rado
+3  A: 

i'd try some 'canonical' pairs, both "should match" and "shouldn't match" pairs, and test only if the confidence is above (or below) a given threshold.

maybe you can also do some ordering checks, such as "no pair should have greater confidence than the one from the exact match pair", or "the pair that matches all consonants should be >= the only vowels one".

Javier
Testing failure would be a good idea as well, at least to make sure the algorithm's confidence isn't arbitrary. As you pointed out, discarding results within a range, (25-75 percentile, maybe) would be a good idea as well.
Stuart Branham
A: 

It sounds as though you are describing an algorithm which is deterministic, but one which is sufficiently difficult that your best initial guess at the correct result is going to be whatever your current implementation delivers to you (aka deterministic implementation to satisfy fuzzy requirements).

For those sorts of circumstances, I will use a "Guru Checks Changes" pattern. Generate a collection of inputs, record the outputs, and in subsequent runs of the unit tests, verify that the outcome is consistent with the previous results. Not so great for ensuring that the target algorithm is implemented correctly, but it is effective for ensuring that the most recent refactoring hasn't changed the behavior in the test space.

A variation of this - which may be more palatable for your circumstance, is to start from the same initial data collection, but rather than trying to preserve precisely the same result every time you instead predefine some buckets, and flag any time an implementation change moves a test result from one confidence bucket to another.

Samples that have clearly correct answers (exact matches, null matches, high value corner cases) should be kept in a separate test.

VoiceOfUnreason
What if I enhance the algorithm somehow to produce better results? What if I do that *and* refactor? It's not a bad idea, but in my case I'm not sure I would be able to trust that test.
Stuart Branham