views:

38

answers:

1

I'd like to sift text (in particular, Twitter messages) to see if they relate to a particular topic. Have you been down that road? If so, I'd love to hear what approach you'd use.

For my case, just searching for topic keywords gets me useful text about 7% of the time; the keywords have multiple meanings, some of which aren't on topic. For my use, automatic filtering doesn't need to be perfect; I'd be happy if the extracted messages related to the topic 80% of the time. I'm also willing to lose 10-30% of the on-topic messages.

Doing a first pass by hand, there are some characteristics that make messages pretty likely to be good, like certain English phrases. Other characteristics give a high likelihood of rejection, like URLs, multiple hash tags, and other phrases. Others are harder to evaluate.

I could manually make a bunch of regexes and associated weights, and tweak things by hand until I got output I liked. That could well work. But I can name several other possible approaches, and I'm wondering which ones Stack Overflow readers have had good luck with.

Thanks!

+1  A: 

This is an entire field in itself! I recommend doing some research in the natural language processing literature.

There are ad-hoc ways to do it, but these methods would be very error prone: many false positives and false negatives. It may be a good start though.

  1. If you use a keyword, you can attempt to disambiguate the meaning of keyword (if it has multiple meanings) by using the words around the keyword in question. But, to do this disambiguation would require a processed corpus (bunch of documents) to be able to determine which words appear together most frequently, and may mean the same thing.

  2. You could measure the distance between the text you are analyzing and a document that is known to be similar. You would need to use the word counts from both text sources, and then compare the term/document vectors. Look up "document vector model" for a more thorough treatment.

This is a good project to work on, but it is not simple.

Ryan Rosario
Thanks, Ryan. I'll take a look at these.Whichever road I go, it seems like having a large number of pre-classified examples will be helpful. So I'm first going to put together a Mechanical Turk job.Regarding option 1, which looks promising, do you have any links or googleable phrases that would lead me in the right direction? "Document vector model" gets me plenty for #2, but I'm having trouble finding more on #1.Thanks again!
William Pietri
For #1, you could "word sense disambiguation" as a phrase with 'using neighboring words' as keywords. There are probably better ways to do it than #1 (lexical chains is one way), but that came to mind the quickest. Take a look here:http://www.scholarpedia.org/article/Word_sense_disambiguation
Ryan Rosario