I'd like to sift text (in particular, Twitter messages) to see if they relate to a particular topic. Have you been down that road? If so, I'd love to hear what approach you'd use.
For my case, just searching for topic keywords gets me useful text about 7% of the time; the keywords have multiple meanings, some of which aren't on topic. For my use, automatic filtering doesn't need to be perfect; I'd be happy if the extracted messages related to the topic 80% of the time. I'm also willing to lose 10-30% of the on-topic messages.
Doing a first pass by hand, there are some characteristics that make messages pretty likely to be good, like certain English phrases. Other characteristics give a high likelihood of rejection, like URLs, multiple hash tags, and other phrases. Others are harder to evaluate.
I could manually make a bunch of regexes and associated weights, and tweak things by hand until I got output I liked. That could well work. But I can name several other possible approaches, and I'm wondering which ones Stack Overflow readers have had good luck with.
Thanks!