Are there any easy ways to implement filtering a user's input (possibly a question) by extracting the meaningful data in the query?
I basically want to filter out any noise words so I can send a 'clean' query to Google's search api.
Are there any easy ways to implement filtering a user's input (possibly a question) by extracting the meaningful data in the query?
I basically want to filter out any noise words so I can send a 'clean' query to Google's search api.
Jeff talked about "stop words" in one of the previous stackoverflow podcasts. You might try searching for that phrase on google. The wikipedia page seems to have some overview and pointers to options.
You can try removing the top X most common English words, but you will always run into trouble with a naive approach like this.
This is because common English words can have special significance in the realm of Computer Science (or other areas). A recent SO podcast (#32) mentions this very issue.
This is not as easy as it might seem:
http://www.google.com/support/bin/answer.py?hl=en&answer=981
Um, won't Google do this for you? Send all those dirty, filthy words to Google and let them clean them up for you.