Interesting idea. I would guess this has been studied before, a look in some comp-sci journal should turn up a few good pointers. That said here are a few idea I have:
Method
You could find the most-unique key-phrases and see how well they match with the key phrases with the other articles. I would imagine the data published by google on the frequency of phrases on the web would give you baseline.
You somehow need to pickup on the fact that "in the" is a very common phrase but "Kingsley visits" is important. Once you have filtered down all the text to just the key phrases you could see how many of them match.
key phrases:
- set of all verbs, nouns, names, and novel (new/mis-spelt) words
- you could grab phrases that are say, between one and five words long
- remove all that are very common (could have classifier on common phrases)
- see how many of them match between articles.
- have a controllable slider to set the matching threshold
It's not going to be easy if you write this yourself but I would say it's a very interesting problem area.
Example
If we just using the titles and follow the method through by hand.
Ben Kingsley visits Taj Mahal with wife will create the following keywords:
- Ben Kingsley
- Kingsley
- Kingsley visits
- wife
- Mahal
- ... etc ...
but these should be removed as they are too common (hence don't help to uniquely identify the content)
once the same is done with the other title Kingsley romances wife in Taj's lawns then you can compare and find that quite a few key phrases match each other. Hence they are on the same subject.
Even though this is already a large undertaking there are many thing you could do to further the matching.
Extensions
These are all ways to trim the keyword set down once it is created.
WordNet would be a great start to looking into getting a match between say "longer" and "extend". This would be useful as articles wont use the same lexicon for their writing.
You could run a Bayesian Classfier on what counts as a key-phrase. It can be trained by having the set of all matching/non-matching articles and their key-phrases. You would have to be careful about how you deal with unseen phrases as these are likely to be the most important thing you come across. It might even be better to run it on what isn't a key-phrase.
It might even be an idea to calcluate the Levenshtein distance between some of the key-phrases if nothing else found a match. I'm guessing it is likely that there will always be some matches found.
I have a feeling that this is one of those things where a very good answer will get you a PhD. Than again, I suppose it has already been done before (google must have some automatic way to scrape all those news sites and fit them into categories and similar articles)
good luck with it.