First thoughts:
- toss away noise words (and, you, is, the, some, ...).
- count all other words and sort by quantity.
- for each word in the two articles, add a score depending on the sum (or product or some other formula) of the quantities.
- the score represent the similarity.
It seems to be that an article primarily about Donald Rumsfeld would have those two words quite a bit, which is why I weight them in the article.
However, there may be an article mentioning Warren Buffet many times with Bill Gates once, and another mentioning both Bill Gates and Microsoft many times. The correlation there would be minimal.
Based on your comment:
So if an article was about Saddam Hussein, then the algorithm might recommend something about Donald Rumsfeld's business dealings in Iraq.
that wouldn't be the case unless the Saddam article also mentioned Iraq (or Donald).
That's where I'd start and I can see potential holes in the theory already (and article about Bill Gates would match closely with an article about Bill Clinton if their first names are mentioned a lot). This may well be taken care of by all the other words (Microsoft for one Bill, Monica for the other :-).
I'd perhaps give it a test run before trying to introduce word-proximity functionality since that's going to make it very complicated (maybe unnecessarily).
One other possible improvement would be maintaining 'hard' associations (like always adding the word Afghanistan to articles with Osama bin Laden in them). But again, that requires extra maintenance for possibly dubious value since articles about Osama would almost certainly mention Afghanistan as well.