views:

105

answers:

3

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.

The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.


(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)

As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.

A: 

Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.

You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.

mmr
Spam tends to be a soft, one dimensional domain. However in my case I'm looking more at a hard (there are actual rules, I just don't want to have to revers engineer them), unordered, multi way (pick one of N) choice. I guess I should remove the word "decision".
BCS
@BCS-- I just remember that the best deployed text parsing solution using Bayes' rules are spam filters. These filters do have to overcome trends in spammer tactics, like quoting literature and the like, as opposed to just finding random incorrect letters. I don't know if it will solve your particular problem, it's just the only example of a robust deployed text solution I know of. If you find something else, I'd definitely be very happy to know of it.
mmr
+1  A: 

Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with

adi92
+2  A: 

Doing Things Manually

If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.

This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.

Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.

Document Classification

Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:

Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.

Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.

C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.

Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.

Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.

dmcer
Re the bit on partial table + default; I was thinking of a solution along that line but that takes it one further. Rather than have a user define classification rules, they just classify specific examples and then the machine learning defines the rules from that. This has the advantage that by hooking where misclassified data gets reassigned, I can grow the rules without anyone needing to do anything but say "Itme A goes to Mr. B".
BCS
+1 for the simple 'no-moving-parts' LuT suggestion (which might work, might not--no way to tell w/o more info).
doug