I'm looking for ideas on recommended approach.
I'm trying to scrape some headlines and body text from articles for a few specific sites, similar to what Google does with Google News.
The problem is across different sites, they may have articles on the same exact subject, worded slightly differently.
Can anyone point to me what I need to know in order to write a comparison algorithm to auto-detect similar articles? Is there any library out there right now that can be used for text comparisons and return some type of similarity rating?
Thanks very much in advance.
I use Python.