views:

48

answers:

1

A webapp called StatSheet got funded today -

http://techcrunch.com/2010/08/04/former-crunchies-finalist-statsheet-recieves-1-3-million-in-series-a/

They are doing 'automated journalism' - using computers to generate human-looking reports of sports games from the statistics

http://www.guardian.co.uk/media/pda/2010/mar/30/digital-media-algorithms-reporting-journalism

Does anyone have any insight into what approach/algorithms are being used to do this / how it might be replicated ?

+5  A: 

The details for projects like this are a little sparse, but it looks like the baseball summarizer Stats Monkey consists of:

  1. Statistical model: They build a model of how baseball games typically unfold, most likely by looking at how certain variables (e.g. runs, at bats, etc.) change during the course of a game or differ from what you'd expect to see going into the game (e.g. a no-name team scores more runs than a highly-favored team). How well a given game fits (or doesn't fit) this model gives them an idea of what might be interesting about that game (e.g. key plays or players).

  2. Text generation: Given a library of pre-written narrative arcs (e.g. back-and-forth game, come-from-behind victory, etc.) they use the "interesting information" from the model of the game to construct a summary of the game. I'm not sure, but it looks like they use a decision tree -- conditioned on the information from the model -- to select one of these arcs.

  3. Miscellaneous glue: This isn't mentioned in their writeup, but there I'd imagine that there are a fair number of hard-coded rules that "glue" the main narrative arcs into a single, cohesive story.

The authors of Stats Monkey have done a fair amount of research in related areas, like website summarization and automatic content aggregation and generation. Here are a few papers that might be interesting:

Nate Kohl