views:

234

answers:

10

I run a website that allows users to write blog-post, I would really like to summarize the written content and use it to fill the <meta name="description".../>-tag for example.

What methods can I employ to automatically summarize/describe the contents of user generated content?
Are there any (preferably free) methods out there that have solved this problem?

(I've seen other websites just copy the first 100 or so words but this strikes me as a sub-optimal solution.)

+1  A: 

I might try using mechanical Turk or any number of other crowdsourcing options.

Mark P Neyer
A: 

This borders on artificial intelligence so there's not going to be an "easy" solution out there, but there are products that target this problem.

Check out Copernic Summarizer, for one.

David
+1  A: 

Another item to check out, a SourceForge project, AutoSummary Semantic Analysis Engine

David
Looks promising
Jacco
+1  A: 

Not a trivial task... You should look for articles or books on "extractive summarization"

A few starters could be:

Books:

Articles:

Fernando
The "how to identify the gist of a text" paper also has software available: http://www.icmc.usp.br/~taspardo/GistSumm.htm
Nate Kohl
Also, the MEAD project (http://www.summarization.com/mead/) by some folks at the University of Michigan looks like it has software available, although the link is down right now.
Nate Kohl
+4  A: 

Make it predictable.

From a users perspective simply using the first paragraph is not bad at all. Using any automation is bound to fall flat in some cases. So I suggest to display the first paragraph (maybe truncating at some point) as a summary and offer the ability to override that by an optional field.

phoku
+6  A: 

Think of the task of summarization as a challenge to 'select the most important sentences' from the document.

The method described in The Automatic Creation of Literature Abstracts by H.P. Luhn (1958) describes a naive method that actually performs quite well. Try giving it a shot.

If your website is in Python coding this algorithm using the NLTK (Natural Language Toolkit) is a fun task.

theycallmemorty
Unfortunately it is in PHP (+1)
Jacco
+1  A: 

Yahoo has a free API for this: http://developer.yahoo.com/search/content/V1/termExtraction.html

Eugene Osovetsky
This service extracts keywords from a given string. Nice, but not answering the question.
Jacco
A: 

Noun phrases typically tend to be important elements of a sentence. Picking sentence(s) with a high density of noun phrases could yield a good summary. You could get noun phrases using a POS tagger.

For a good summary, it is desirable that it is a meaningful sentence. Reading a broken sentence is slightly jarring.

Shashikant Kore
+1  A: 

Apple's patent 6424362 - Auto-summary of document content contains sample code which might be useful...

Stobor
Somehow, using code from a patent doesn't seem the best legal choice.
Jacco
A: 

Alternatively, when the author posts the article, the author can highlight what are the keywords that can be used in the description which can then be automatically put in the meta description tag.

vikramjb
I've been thinking about this option.. but I would like to keep the system as easy as possible for the user. So this option is not possible. (It is great for paid contributions and stuff, but not for my audience)
Jacco