views:

172

answers:

4

Hi all,

We have a client who is looking for a means to import and categorize a large amount of textual data. This data has to be categorized and it's been suggested that the easiest way to to do this would be to look at the description field and try to match the words held there to see if a category can be derived for that particular record.

It was thought the best way to do this would be matching the words to key words held against each category and if that was unsuccessful then to use some kind of synonym look up to see if this could be used instead. So for example, if a particular record had the word "automobile" in it then a synonym look up could match that word to the word "car" which would be held against the category "vehicle".

Does anyone know of a web service or other means of looking up a dictionary to find synonyms for a particular word? The project manager has suggested buying a Google Enterprise Search license for this but from what I can make out that doesn't offer what these guys are looking for.

Any suggestions of other getting the client what they are looking for would be gratefully accepted.

Cheers.

+2  A: 

The first thing that comes to mind is Wordnet. Wordnet is a human-generated database of words and related words, including synonyms. The Wikipedia Wordnet entry lists several interfaces to Wordnet. I believe some of them are web services. You can also roll your own. Manning and Schutze's chapter 5 (free PDF) shows ways to do this.

Having said that, are you solving the right problem? How do you build the category list? Is it a hierarchy? a tag cloud? See Clay Shirky's Ontology is Overrated for a critique of hierarchical categories. I believe that synonyms are less important if you base your classification on sets of words (Naive Bayes, for example) rather than on single words.

Yuval F
+1  A: 

You should look at using WordNet. You can visit their website http://wordnet.princeton.edu/ to get more information, but there are libraries available for integrating against them in lots of languages.

Go to their online tool to see the use of it in action here: http://wordnetweb.princeton.edu/perl/webwn. If you look up a word, then click on "S" next to each definition, you'll get a list of semantically related words to that definition.

I also think you should check out software that will allow you to perform "document clustering." Here is an example: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview. That should help you bootstrap the category creation process.

I think this will help get you a long way toward what you want!

Chris Harris
A: 

Thanks guys! I'll look into Wordnet.

Do you know of any other types of textual classification software products out there. I see there's some discussion of using Bayasian algorithms for this but I can't see any real world examples of it.

Dave
You can start with:http://stackoverflow.com/questions/1083/bayesian-filtering-for-spamor post a separate question (preferably with a desired language/OS) and I believe you will get plenty of suggestions.
Yuval F
A: 

For text classification you can take a look at Apache Mahout.

Shashikant Kore