views:

75

answers:

2

Hi, I collect news for certain topics and then run bayesian classfier on them to mark them as interesting or non-interesting. I see that there are news which are different articles are essentially the same news. e.g. - Ben Kingsley visits Taj Mahal with wife - Kingsley romances wife in Taj's lawns

How do I teach the system to mark all these as duplicates?

Thanks Sanjay

+4  A: 

Interesting idea. I would guess this has been studied before, a look in some comp-sci journal should turn up a few good pointers. That said here are a few idea I have:

Method

You could find the most-unique key-phrases and see how well they match with the key phrases with the other articles. I would imagine the data published by google on the frequency of phrases on the web would give you baseline.

You somehow need to pickup on the fact that "in the" is a very common phrase but "Kingsley visits" is important. Once you have filtered down all the text to just the key phrases you could see how many of them match.

key phrases:

  • set of all verbs, nouns, names, and novel (new/mis-spelt) words
  • you could grab phrases that are say, between one and five words long
  • remove all that are very common (could have classifier on common phrases)
  • see how many of them match between articles.
  • have a controllable slider to set the matching threshold

It's not going to be easy if you write this yourself but I would say it's a very interesting problem area.

Example

If we just using the titles and follow the method through by hand.

Ben Kingsley visits Taj Mahal with wife will create the following keywords:

  • Ben Kingsley
  • Kingsley
  • Kingsley visits
  • wife
  • Mahal
  • ... etc ...

but these should be removed as they are too common (hence don't help to uniquely identify the content)

  • Ben
  • with wife

once the same is done with the other title Kingsley romances wife in Taj's lawns then you can compare and find that quite a few key phrases match each other. Hence they are on the same subject.

Even though this is already a large undertaking there are many thing you could do to further the matching.

Extensions

These are all ways to trim the keyword set down once it is created.

  1. WordNet would be a great start to looking into getting a match between say "longer" and "extend". This would be useful as articles wont use the same lexicon for their writing.

  2. You could run a Bayesian Classfier on what counts as a key-phrase. It can be trained by having the set of all matching/non-matching articles and their key-phrases. You would have to be careful about how you deal with unseen phrases as these are likely to be the most important thing you come across. It might even be better to run it on what isn't a key-phrase.

  3. It might even be an idea to calcluate the Levenshtein distance between some of the key-phrases if nothing else found a match. I'm guessing it is likely that there will always be some matches found.

I have a feeling that this is one of those things where a very good answer will get you a PhD. Than again, I suppose it has already been done before (google must have some automatic way to scrape all those news sites and fit them into categories and similar articles)

good luck with it.

James Brooks
This is pretty good insight into the problem. I am guessing that the way Bayesian workd, it would find the key-phrases and then see how closely it resembles to the already trained data to classify it. What I want is that the two strings to be matched with how many keyphrases are similar between the two to mark them as duplicate. This looks similar to Bayesian algorithm only that it should not need any training data. I would prefer to find an existing solution that can be tweaked rather than doing the PhD myself ;-)
Sanjay
I am unaware of a system that already exists but there must be one out there somewhere.
James Brooks
+2  A: 

This is a classification problem, but harder given the number of distinct classes you will have. One option might be to reduce the size of each document using Feature Selection (more info). Feature selection involves selecting the top n terms (excluding stop words, and possibly applying stemming to each word as well). Do this by calculating, for each document, the mutual information (more info) of each term, ordering the terms by that number and selecting the top n terms for each document. This reduced feature set of top n terms for each document can now form the basis for performing your duplicate selection (for example, if there are more than x% common terms between any documents, again x calculated through backtesting),

Most of this is covered in this free book on information retrieval.

Joel