views:

95

answers:

2

how do news outlets like google news automatically classify and rank documents about emerging topics, like "obama's 2011 budget"?

i've got a pile of articles tagged with baseball data like player names and relevance to the article (thanks, opencalais), and would love to create a google news-style interface that ranks and displays new posts as they come in, especially emerging topics. i suppose that a naive bayes classifier could be trained w/ some static categories, but this doesn't really allow for tracking trends like "this player was just traded to this team, these other players were also involved."

+4  A: 

No doubt, Google News may use other tricks (or even a combination thereof), but one relatively cheap trick, computationally, to infer topics from free-text would exploit the NLP notion that a word gets its meaning only when connected to other words.
An algorithm susceptible of discovering new topic categories from multiple documents could be outlined as follow:

  • POS (part-of-speech) tag the text
    We probably want to focus more on nouns and maybe even more so on named entities (such as Obama or New England)
  • Normalize the text
    In particular replace inflected words by their common stem. Maybe even replace some adjectives by a corresponding Named Entity (ex: Parisian ==> Paris, legal ==> law)
    Also, remove noise words and noise expressions.
  • identify some words from a list of manually maintained "current / recurring hot words" (Superbowl, Elections, scandal...)
    This can be used in subsequent steps to provide more weight to some N-grams
  • Enumerate all N-grams found in each documents (where N is 1 to say 4 or 5)
    Be sure to count, separately, the number of occurrences of each N-gram within a given document and the number of documents which cite a given N-gram
  • The most frequently cited N-grams (i.e. the ones cited in the most documents) are probably the Topics.
  • Identify the existing topics (from a list of known topics)
  • [optionally] Manually review the new topics

This general recipe can also be altered to leverage other attributes of the documents and the text therein. For example the document origin (say cnn/sports vs. cnn/politics ...) can be used to select domain specific lexicons. Another example the process can more or less heavily emphasize the words/expressions from the document title (or other areas of the text with a particular mark-up).

mjv
+2  A: 

The main algorithms behind Google News have been published in the academic literature by Google researchers:

Tristan