views:

191

answers:

1

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?

Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.

How should i go about doing this ? Is my approach the right one ?

Which programming language would be best suited for the implementation ?

+10  A: 

Existing Implementations of Naive Bayes

You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:

Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.

Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.

Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.

C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.

Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.

Bootstrapping Classification from Keywords

It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.

This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

dmcer