Neural networks for email spam detection

views:

777

answers:

+3 Q:

Neural networks for email spam detection

Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups

genuine email
spam

How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?

Let's assume that the email fetching is already in place and we need to focus on classification part only.

The main points which I would hope to get answered would be:

Which parameters to choose as the input for the NN, and why?
What structure of the NN would most likely work best for such task?

Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome

Thank you

EDIT

I am set on using neural networks as the main aspect on the project is to test how the NN approach would work for spam detection
Also it is a "toy problem" simply to explore subject on neural networks and spam
I should also mention that this is simply an exercise that my nephew came out with, and I was just asked for some advice. He is not a programmer by profession but with a pretty good programming skills. He simply wants to use that as a way to keep up with his programming skills and to explore the NN. His mind is very much set on "spam detection" in this context as well.

+5 A:

Are you set on doing it with a Neural Network? It sounds like you're set up pretty well to use Bayesian classification, which is outlined well in a couple of essays by Paul Graham:

The classified history you have access to would make very strong corpora to feed to a Bayesian algorithm, you'd probably end up with quite an effective result.

Chad Birch 2009-04-20 21:47:23

Thanks Chad, yes I am set on doing it with NN, that is a requirement, and it is really to test if the NN approach would work in this context.

kristof 2009-04-20 22:05:23

+2 A:

You'll basically have an entire problem, of similar scope to designing and training the neural net, of feature extraction. Where I would start, if I were you, is in slicing and dicing the input text in a large number of ways, each one being a potential feature input along the lines of "this neuron signals 1.0 if 'price' and 'viagra' occur within 3 words of each other", and culling those according to best absolute correlation with spam identification.
I'd start by taking my best 50 to 200 input feature neurons and hooking them up to a single output neuron (values trained for 1.0 = spam, -1.0 = not spam), i.e. a single-layer perceptron. I might try a multi-layer backpropagation net if that worked poorly, but wouldn't be holding my breath for great results.

Generally, my experience has led me to believe that neural networks will show mediocre performance at best in this task, and I'd definitely recommend something Bayesian as Chad Birch suggests, if this is something other than a toy problem for exploring neural nets.

chaos 2009-04-20 22:06:43

Cheers Chaos, good point. I would too consider the feature extraction as a problem of similar complexity as the NN itself. And yes it is really a toy problem for exploring neural nets

kristof 2009-04-21 08:30:19

+1 A:

Chad, the answers you've gotten so far are reasonable, but I'll respond to your update that:

I am set on using neural networks as the main aspect on the project is to test how the NN approach would work for spam detection.

Well, then you have a problem: an empirical test like this can't prove unsuitability.

You're probably best off learning a bit about what NN actually do and don't do, to see why they are not a particularly good idea for this sort of classification problem. Probably a helpful way to think about them is as universal function approximators. But for some idea of how this all fits together in the area of classification (which is what the spam filtering problem is), browsing an intro text like pattern classification might be helpful.

Failing that if you are dead set on seeing it run, just use any general NN library for the network itself. Most of your issue is going to be how to represent the input data anyway. The `best' structure is non-obvious, and it probably doesn't matter that much. The inputs are going to have to be a number of (normalized) measurements (features) on the corpus itself. Some are obvious (counts of 'spam' words, etc), some much less so. This is the part you can really play around with, but you should expect to do poorly compared to Bayesian filters (which have their own problems here) due to the nature of the problem.

simon 2009-04-20 23:28:22

Thanks Simon, the first thing that came to my mind when I heard about the idea was actually: what sort of parameters could be used for the input. You are also right that this sort of test cannot really prove unsuitability. I should probably add that this is actually just an exercise just to play with both NN and spam detection problem for someone that is pretty new in the field of AI - I will update my question to shed some more light on this :)

kristof 2009-04-21 08:22:14

+4 A:

If you insist on NNs... I would calculate some features for every email

Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):

Total no of characters (C)
Total no of alpha chars / C Ratio of alpha chars
Total no of digit chars / C
Total no of whitespace chars/C
Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
Total no of words (M)
Total no of short words/M Two letters or less
Total no of chars in words/C
Average word length
Avg. sentence length in chars
Avg. sentence length in words
Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
Type Token Ratio No. Of unique Words/ M
Hapax Legomena Freq. of once-occurring words
Hapax Dislegomena Freq. of twice-occurring words
Yule’s K measure
Simpson’s D measure
Sichel’s S measure
Brunet’s W measure
Honore’s R measure
Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – “ « » < > [ ] { }

You could also add some more features based on the formatting: colors, fonts, sizes, ... used.

Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).

So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.

The inputs would need to be normalized according to your current pre-classified corpus.

I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

Osama ALASSIRY 2009-05-06 12:54:11

ansaurus

tags:

views:

answers:

Neural networks for email spam detection

related questions