views:

612

answers:

2

Hello!

The German website nandoo.net offers the possibility to shorten a news article. If you change the percentage value with a slider, the text changes and some sentences are left out.

You can see that in action here:

http://www.nandoo.net/read/article/299925/

The news article is on the left side and tags are marked. The slider is on the top of the second column. The more you move the slider to the left, the shorter the text becomes.

How can you offer something like that? Are there any algorithms which you can use to achieve that?

My idea was that their algorithm counts the number of tags and nouns in a sentence. Then the sentences with fewest number of tags/nouns are left out.

Could that be true? Or do you have another idea?

I hope you can help me. Thanks in advance!

+2  A: 

Usually you want to keep the sentences that have words that are more unique to that article.

That is, the more "generic" the sentence is, the less it describes this particular article.

The normal way to do this is Bayesian analysis much like a spam-filter. First determine which words in the entire article appear more often than you'd expect, then find the sentences that feature those words.

Jason Cohen
Thank you! Then you only have to store the number of occurrences of all words in your database. That's no problem. But why do you need a Bayesian analysis? You can go through the text, select the frequency of the words and count them for every sentence. Right?
You shouldn't use pure counts because words that are naturally more abundant are *expected* to have high counts, whereas you're looking for words in which the counts are high *relative* to expected. Bayesian analysis does exactly that.
Jason Cohen
Thx! So I select the average number of occurrences of the words from the database. Then I determine which words appear more often in this text than in average. At least, I select the sentences which contain these unexpected frequent words. Right?
That's better but still not exactly the right math. See this for details: http://en.wikipedia.org/wiki/Bayesian_probability
Jason Cohen
+2  A: 

This is a hot research topic in Computational Linguistics. The shallow approach, using Bayesian Filtering, is not likely to yield perfect results - but you probably don't need perfect results anyway.

In CL, the 80-20 rule quickly becomes the 95-5 rule, so if you are content with what you can achieve through shallow methods, skip this answer.

If you want to see whether you can improve on your results, you could try to find some better resources. The task you're referring to is called 'text summarization' in the research community, and it has its own web page which is hopelessly outdated. Mani and Maybury (1999) is probably a good overview (I haven't read it myself,) but also quite antiquated. More recent is Martin Hassels dissertation on the topic, and also quite exhaustive, including language-independent (read: statistical, i.e. shallow) methods.

As always, Google will be able to help you, too. Just search for text summarization.

Aleksandar Dimitrov
Thanks, so I know what to do if Bayesian Filtering gives insufficient results.