views:

265

answers:

2

I want to determine the similarity of the content of two news items, similar to Google news but different in the sense that I want to be able determine what the basic topics are then determine what topics are related.

So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.

If you can just throw around key words like k-nearest neighbours and a little explanation about why they work (if you can) I will do the rest of the reseach and tweak the algorithm. Just looking for a place to get started, since I know someone out there must have tried something similar before.

A: 

At the moment I am thinking of something like this.

Each non-noise-word is a dimension. Each article is represented by a vector where the words that don't appear are represented by zero and those that do appear get a value that is equal to the number of times they appear divided by the total words on the page. Then I can take Euclidean distance between each of the points in this space to get the similarity of any two articles.

The next step would be to determine clusters of the articles, and then determine a central point for each cluster. Then compute the Euclidean distance between any two clusters which gives the similarity of the topics.

Baaah I think by typing it out I solved my own problem. Of course only in a very high level way, I am sure when I get down to it I will find problems ... the devil is always in the detail.

But comments and improvements still highly appreciated.

Ankur
+3  A: 

First thoughts:

  • toss away noise words (and, you, is, the, some, ...).
  • count all other words and sort by quantity.
  • for each word in the two articles, add a score depending on the sum (or product or some other formula) of the quantities.
  • the score represent the similarity.

It seems to be that an article primarily about Donald Rumsfeld would have those two words quite a bit, which is why I weight them in the article.

However, there may be an article mentioning Warren Buffet many times with Bill Gates once, and another mentioning both Bill Gates and Microsoft many times. The correlation there would be minimal.

Based on your comment:

So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.

that wouldn't be the case unless the Saddam article also mentioned Iraq (or Donald).

That's where I'd start and I can see potential holes in the theory already (and article about Bill Gates would match closely with an article about Bill Clinton if their first names are mentioned a lot). This may well be taken care of by all the other words (Microsoft for one Bill, Monica for the other :-).

I'd perhaps give it a test run before trying to introduce word-proximity functionality since that's going to make it very complicated (maybe unnecessarily).

One other possible improvement would be maintaining 'hard' associations (like always adding the word Afghanistan to articles with Osama bin Laden in them). But again, that requires extra maintenance for possibly dubious value since articles about Osama would almost certainly mention Afghanistan as well.

paxdiablo
The reason for plotting articles and taking Euclidean distance is so that you can recognise which are similar without them necessarily having the same keywords.Saddam Hussein articles and Donald Rumsfeld articles are likely to both have Baghdad in them and so we can infer a relationship between them
Ankur
Ankur, in the case you mention, they *would* have the same keyword (Baghdad) - they just wouldn't be Donald or Saddam in common. So that similarity would still be picked up with my "solution". Trouble is, an article about Rumsfeld and another about Donald Duck may be considered closer.
paxdiablo
Short of changing the keywords from Donald and Rumsfeld and Duck to "Donald Rumsfeld" and "Donald Duck", it won't get any better. That may be where the proximity detection becomes necessary.
paxdiablo
I think this is what you are saying: (provided enough computing power) instead of using just words, phrases can be used. So any string between two noise words can be considered a phrase, and perhpas the less common a phrase the higher its value.
Ankur
@Ankur, that's one approach. Another (possibly easier) way would be to just use words AND groups of two words as the key values. So "Donald Duck died" contains the "words" {'donald', 'duck', 'died', 'donald duck', 'duck died'}. This would at least catch the two-word names with less effort.
paxdiablo