views:

232

answers:

2

I would like to implement a naive bayes classifier for spam filtering from scratch as a learning exercise. What would be the best langauge of the following to try this out in?

  1. Java
  2. Ruby
  3. C++
  4. C
  5. something else

Please give reasons (it would help greatly!)

+1  A: 

I would do it in C#, but that's only because it's the language that I'm most familiar with at the moment, and because I know it's got strong string handling. It can also be done in C++ with stl::string classes, Ruby, Java, etc.

If I were building a naive bayes classifier, I'd start with a simple example, like the one in Russell & Norvig's book (the one I learned off of way back when, in the second edition of the book) or the one in Mitchell's book (I used his because he taught the class). Make your learner generate rules in a general fashion; that is, given input data, produce output rules, and have the input data be a generalizable thing (could be a block of text that for spam detection, could be a weather report to predict if someone's going to play tennis).

If you're trying to learn Bayes classifiers, a simple example like this is better to start with than a full-blown spam filter. Language parsing is hard in and of itself, and then determining whether or not there's garbage language is also difficult. Better to have a simple, small dataset, one where you can derive how your learner should learn and make sure that your program matches what you want it to do. Then, you can grow your dataset, or modify your program to incorporate things like language parsing.

mmr
Thanks for all the resources/tips, the Norvig book in particular looks rather good! I am going to be using a small dataset to begin with (probably a subset of the UCI spambase set publicly available). So, as you mentioned, I'm not going to try to do Everything, just the classification part based on pre-made data.I suppose my main issue though is that I'm not sure if there is any language that is particularly good/bad for implementing this. I'm most comfortable in java, is there any reason to tip towards either, or should I just go with what I'm most comfortable with too?
kita
I'd stay with what you're most comfortable. For me, at least, these concepts were very non-intuitive, so removing as many barriers to understanding as possible helped. So, because I didn't have to mess with learning the ins-and-outs of a new language, I could focus on the actual Bayes rather than on figuring out the specifics of char*s vs strings.
mmr
excellent thanks! Java it is for now then :)
kita
+1  A: 

Moving from Bayesian classifiers to programming languages, I'll leave out "something else" as being too broad, and having no patently superior candidates. Of the four you list, I'd avoid C and C++ because who wants to deal with memory management, especially when you're learning? Normally I'd be tempted toward Java because of the static type system, and if you're a beginner I think that's still the safest bet. But Ruby is also a sensible choice because you can prototype new ideas and new examples very rapidly.

I have worked on an maintain a version of a rather powerful Bayesian classifier for reading email. It is written in a mixture of Lua and C. It's highly performant, but one of the things I really regret about the design is that there is very little abstraction built into the code. I definitely recommend building abstractions into the code like

  • Feature extraction

  • Frequency counting

  • The representation of probability

Java makes it really easy to enforce these kinds of abstraction barriers, although Ruby can do it too.

One of the things my colleague Fidelis Assis found is that standard floating-point numbers are not good for representing very small probabilities. We do a fair amount with logarithms of probabilities (where probabilities multiply, the logarithms sum).

Norman Ramsey