ansaurus

Question

TDD and the Bayesian Spam Filter problem

Answer 1

+2 A:

Here's my take: Test Driven Development means writing tests before coding. This does not mean that each unit of code for which you write a test needs to be trivial.

Furthermore you still need to plan your software to do its tasks in a sensible and effective way. Simply adding more and more strings doesn't seem to be the best design for this problem.

So in short, you write the code from the smallest piece of functionality possible (and test it) but you don't design your algorithm (in pseudo code or however you like to do it) that way.

Would be interesting to see if you and others agree.

Ankur 2009-04-20 08:51:24

Answer 2

+2 A:

Begin by writing tests for lower level parts of the spam filter algorithm.

First you need to have in your mind a rough design of how the algorithm should be. Then you isolate a core part of the algorithm and write tests for it. In the case of a spam filter that would maybe be calculating some simple probability using Bayes' theorem (I don't know about Bayesian classifiers, so I could be wrong). You build it bottom-up, step by step, until finally you have all the parts of the algorithm implemented and putting them together is simple.

It requires lots of practice to know which tests to write in which order, so that you can do TDD in small enough steps. If you need to write much more than 10 lines of code to pass one new test, you probably are doing something wrong. Start from something smaller or mock some of the dependencies. It's safer err on the smaller side, so that the steps are too small and your progress is slow, than trying to make too big steps and failing badly.

The "Cheap generic viagra" example that you have might be better suited for an acceptance test. It will probably even run very slowly, because you first need to initialize the spam filter with example data, so it won't be useful as a TDD test. TDD tests need to be FIRST (F = Fast, as in many hundreds or thousands tests per second).

Esko Luontola 2009-04-20 09:23:41

+1 Lower-level parts get tested first.

S.Lott 2009-04-22 14:38:40

Answer 3

A:

For me, what you call minimum amount of code to pass a test is the whole IsSpam() function. This is consistent with its size (you say only a few hundred LoC).

Alternatively, incremental approach does not claim to code first and think afterwards. You can design a solution, code it and then refine the design with special cases or better algorithm.

Anyway, refactoring does not consist simply in adding new stuff over old one. For me this is a more destructive approach, where you throw away old code for a simple feature and replace it with new code for a refined and more elaborate feature.

mouviciel 2009-04-20 09:32:09

Answer 4

A:

You have your unit tests, right?

That means that you can now refactor the code or even rewrite it and use the unit tests to see if you broke something.

First make it work, then make it clean -- It's time for the second step :)

Lennaert 2009-04-20 09:34:47

Answer 5

A:

(1) You cannot say that a string "is spam" or "is not spam" in the same way as if you were saying whether a number is prime. This is not black or white.

(2) It is incorrect, and certainly not the aim of TDD, to write string processing functions using just the very examples used for the tests. Examples should represent a kind of values. TDD does not protect against stupid implementations, so you shouldn't pretend that you have no clue at all, so you shouldn't write return text == "Cheap generic viagra".

Daniel Daranas 2009-04-20 09:39:42

Agree with (1).Re (2): I believe that it *is* the aim of TDD to start writing functionality that way; anything to get the tests green. After a while, you'll notice patterns and you can refactor the code. With the tests in place, you should know instantly whether you've broken anything.If you don't notice any patterns, then either there is no pattern and thus nothing to extract or (more likely) your test suite is incomplete. Add tests and look again.

Lennaert 2009-04-20 09:55:22

The example I used was taken from a lot of the examples I see of how people believe TDD should be done, which do exactly this -- write the minimum amount of code to make the test pass.

Greg Beech 2009-04-20 09:56:47

@Greg: For me, even the starting point must make _some_ sense. I wouldn't start implementing bool IsPrime(int number) const with "return number == 17" and claim that "it works for a tested subset of input data". Maybe many people defend this strategy, but then I'll consider adding my opinion as an answer in the controversial opinions question.

Daniel Daranas 2009-04-20 10:00:29

@Daniel: I wouldn't have a problem doint it that way: When you add the case '19', you can generalise the answer (perhaps to (number % 2 == 1)). After that, add cases 2 and/or 9. Doing it that way, you'll eventually end up with the right algorithm.

Lennaert 2009-04-20 10:03:18

@Lennaert: I don't see how this would be superior to learning some math and implementing the right algorithm in the first place.

Daniel Daranas 2009-04-20 10:04:43

@Daniel - You've essentially arrived back at my question... why would people evolve a solution when it's known that isn't the best way to solve a problem? So it seems like you agree with me that "write the simplest thing that will pass the test" isn't a sensible philosophy, even though that's what "pure" TDD appears to demand.

Greg Beech 2009-04-23 10:56:03

@Greg, yes, I disagree with a literal interpretation of "write the simplest thing that will pass the test". I'd put an alternative, superior rule, such as (1.a) "All the code that you write should follow reasonable logical assumptions towards achieving the goal specified for the function in case", (1.b) "This excludes any particularization to very specific values of the input", and (2) "Write the simplest one that satisfies (1) and will pass the test". And even this would be a general guideline.

Daniel Daranas 2009-04-23 11:01:34

@Greg (continued) To be fair, though, the original rule should apply to a situation in which a reasonably chosen, even if minimal, set of tests has been written. This is not the case of your original example.

Daniel Daranas 2009-04-23 11:04:11

Answer 6

A:

It seems to me, that with a Bayesian Spam Filter, you should be using existing methods. In particular you would be using Bayes' Theorem, and probably some other probability theory.

In that case, it seems the best approach is to decide on your algorithm, based on these methods, which should either be tried and tested, or possibly experimental. Then, your unit tests should be designed to test whether ispam correctly implements the algorithm you decide on, as well as a basic test that the result is between 0 and 1.

The point is, that your unit tests aren't designed to test whether your algorithm is sensible. You should either know that already, or possibly your program is designed as an experiment, to see if it is sensible.

That's not to say performance of the isspam function isn't important. But it doesn't have to be part of the unit testing. The data could be from feedback from alpha testing, new theoretical results, or your own experiments. In that case, a new algorithm may be needed, and new unit tests are needed.

See also this question about testing random number generators.

Silverfish 2009-04-20 11:20:48

Answer 7

A:

The problem here is not with test driven development but with your tests. If you start out developing code against a single test then all your test is doing is specifying a string checking function.

The main idea of TDD is to think about your tests before writing code. You can't exhaustively test a spam filter, but you could come up with a reasonable approximation by tens or hundreds of thousands of test documents. In the presence of that many tests the naive Bayes algorithm is a simpler solution than a hundred thousand line switch statement.

In reality, you may not be able to pass 100% of your unit tests so you just have to try to pass as many as possible. You also have to make sure your tests are sufficiently realistic. If you think about it in this way, test driven development and machine learning have a lot in common.

StompChicken 2009-04-23 10:08:02

Answer 8

A:

The problem you are describing is theoretical, that by adding cruft in response to tests you will make a big, messy ball of mud. The thing you are missing is very important.

The cycle is: Red --> Green --> Refactor

You don't just bounce between red and green. As soon as you have tests passing (green) you refactor the production code and the tests. Then you write the next failing test (red).

If you are refactoring, then you are eliminating duplication and messiness and slop as it grows. You will quickly get to the point of extracting methods, building scoring and rating, and probably bringing in external tools. You will do it as soon as it is the simplest thing that will work.

Don't just bounce between red and green, or all your code will be muck. That refactoring step is not optional or discretionary. It is essential.

tottinge 2010-03-02 18:56:24

Answer 9

A:

I don't think checking if a particular string is spam is really a unit test, its more of a customer test. There's an important difference as its not really a red/greed type of thing. In actuality you should probably have a couple hundred test documents. Initially some will be classified as spam, and as you improve the product the classifications will more directly match what you want. So you should make a custom app to load a bunch of test documents, classify them, and then evaluate the scoring overall. When you're done with that customer test, the score will be very bad since you haven't implemented an algorithm. But you now have a means to measure progress going forward, and this is pretty valuable given the amount of learning/changes/experimentation you can expect going forward.

As you implement your algorithm, (and even the customer test in the firsthand) you can still do TDD with real unit tests. The first test for the bayesian filter compononent won't measure if a particular string evaluates as spam, but whether the string is passed through the bayesian filter component appropriately. Your next tests will then focus on how a bayesian filter is implemented (structuring nodes correctly, applying training data, etc).

You do need a vision as to where the product is going, and your tests and implemention should be directed towards that vision. You can not also just add customer tests in a blind manner, you need to add tests with the overall product vision in mind. Any software development goal will have good tests and bad tests you can write.

Frank Schwieterman 2010-03-02 19:14:50

ansaurus

tags:

views:

answers:

TDD and the Bayesian Spam Filter problem

related questions