tags:

views:

51

answers:

3

I am doing an experimental project.

What i am trying to achieve is, i want to find that what are the keywords in that text.

How i am trying to do this is i make a list of how many times a word appear in the text sorted by most used words at top.

But problem is some common words like is,was,were are always at top. Apparently these are not worth.

Can you people suggest me some good logic to do it, so it finds good related keywords always?

+1  A: 

Well you could use preg_split to get the list of words and how often they occur, I'm assuming that that's the bit you've got working so far.

Only thing I could think of regarding stripping the non-important words is to have a dictionary of words you want to ignore, containing "a", "I", "the", "and", etc. Use this dictionary to filter out the unwanted words.

Why are you doing this, is it for searching page content? If it is, then most back end databases offer some kind of text search functionality, both MySQL and Postgres have a fulltext search engine, for example, that automatically discards the unimportant words. I'd recommend using the fulltext features of the backend database you're using, as chances are they're already implementing something that meets your requirements.

Gordon
Yeh , i also thought of this, to ignore some known unworthy words.But the problem is i am not native English speaker so i am weak at basic grammar rules .and i think the the list of unworthy words will be long. Can i have a list of (i dunno what is the name may be "first person" for "he she i me" words ).
Arsheep
"I", "you", "he", "she", "it", "we", "they" are all personal pronouns
Mark Baker
+1  A: 

Use something like a Brill Parser to identify the different parts of speech, like nouns. Then extract only the nouns, and sort them by frequency.

Mark Baker
Very usefull link
Arsheep
A: 

my first approach to something like this would be more mathematical modeling than pure programming.

there are two "simple" ways you can attack a problem like this; a) exclusion list (penalize a collection of words which you deem useless) b) use a weight function, which for ex. builds on the word length, thus small words such as prepositions (in, at...) and pronouns (I,you,me,his... ) will be penalized and hopefully fall mid-table

I am not sure if this was what you were looking for, but I hope it helps. By the way, I know that contextual text processing is a subject of active research, you might find a number of projects which may be interesting.

posdef