views:

69

answers:

1

I'm looking for ideas on recommended approach.

I'm trying to scrape some headlines and body text from articles for a few specific sites, similar to what Google does with Google News.

The problem is across different sites, they may have articles on the same exact subject, worded slightly differently.

Can anyone point to me what I need to know in order to write a comparison algorithm to auto-detect similar articles? Is there any library out there right now that can be used for text comparisons and return some type of similarity rating?

Thanks very much in advance.

I use Python.

+3  A: 

http://en.wikipedia.org/wiki/Cosine_similarity

Yaroslav
Thanks for the link. Is this the only way, or this is the recommended way? Would http://en.wikipedia.org/wiki/Levenshtein_distance be better or worse?
resopollution
Levenshtein distance usually used for comparing words, not articles. For example for spelling checkers or fuzzy search.
Yaroslav
I've forgotten to mention that finding similarity between articles is a part of the problem. The second part is to group similar articles. This is called clustering if we do not know what groups we are going to produce or classification if we know the groups. You could check for different python libraries for Machine Learning that can do it for you.
Yaroslav
thanks much :-)
resopollution