views:

79

answers:

2

I have thousands of sentences in a file. I want to find only right/useful English Language words. Is it possible with Natural Language Processing?

Sample Sentence:

~@^.^@~ tic but sometimes world good famous tac Zorooooooooooo

I just want to extract only English Words like

tic world good famous

Any Advice how can I achieve this. Thanks in Advance

+3  A: 

You can use the WordNet API for looking up words.

Mark Cidade
Its good but I have text in different languages
EarnWhileLearn
@Shahid: Your question says you're only interested in English...
Chris S
@Shahid Additionally, using WordNet (English), the valid words from that example are: {tic, but, sometimes, world, good, famous}. If there are certain words you want to avoid (i.e. non-"useful" for you), you need a stop-word list as @regexhacks described. If you want other languages, there are a handful of non-English WordNet-like libs available: http://en.wikipedia.org/wiki/WordNet#Other_languages
msbmsb
A: 

You need to compile a list of stop words (once you don't want to enlist in your search) afterwards you can filter your search, using that stop words list. for details you should consider looking at these wikipedia article

  1. http://en.wikipedia.org/wiki/Stop_words
  2. http://en.wikipedia.org/wiki/Natural_language_processing
regexhacks