ansaurus

Question

How to find out the summarized text of a given URL in python / Django ?

Answer 1

+4 A:

Text summarization is a fairly complicated topic. If you have a need to do this in a serious way, you may wish to look at projects like Lemur (http://www.lemurproject.org/).

However, what I suspect you really want is a text abstract here. If you know what part of the document contains the body text, locate it using an HTML parsing library like BeautifulSoup, and then strip out the HTML; take the first sentence, or first N characters (which ever suits best), and use that. Sort of a poor cousin's abstract-generator :-)

Jarret Hardie 2009-03-09 15:54:12

Answer 2

A:

Your best bet in this case would be to use a HTML parsing library like BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)

From there, you can fetch for example, all the pages p tags:

import urllib2

from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.bloomberg.com/apps/newspid=20601103&sid=a8p0FQHnw.Yo&refer=us")

soup = BeautifulSoup(page)

soup.findAll('p')

And then, do some parsing around. It depends entirely on the page, as every site is structured differently. You can get lucky on some sites as they may do and you simply look for a p tag with the id#summary in it, while others (like Blooberg) might require a bit more playing around with.

Bartek 2009-03-09 15:56:09

Answer 3

A:

you can also try Pyquery

2009-03-09 16:09:25

Answer 4

+3 A:

Check out the Natural Language Toolkit. Its a very useful python library if you're doing any text-processing.

Then look at this paper by HP Luhn (1958). It describes a naive but effective method of generating summaries of text.

Use the nltk.probability.FreqDist object to track how often words appear in text and then score sentences according to how many of the most frequent words appear in them. Then select the sentences with the best scores and voila, you have a summary of the document.

I suspect the NLTK should have a means of loading documents from the web and getting all of the HTML tags out of the way. I haven't done that kind of thing myself, but if you look up the corpus readers you might find something helpful.

theycallmemorty 2009-03-09 16:10:02

Answer 5

A:

Thanks for those pointers.Will look at it and try to solve this problem.

Rama Vadakattu 2009-03-10 04:03:59

@Rama - Just so you know why you were downvoted, usually the best way to thank people is to up-vote or accept their answers. The answer section should be used only for answers to the question asked, although its okay to answer your own question.

Tristan Havelick 2010-02-19 22:44:49

Answer 6

+2 A:

I had the same need, and lemur, although it has summarization capabilities, I found it buggy to the point of being unusable. Over the weekend I used nltk to code up a summarizer module in python: http://tristanhavelick.com/summarize.zip

I took the algorithm from the Java library Classifier4J here: http://classifier4j.sourceforge.net/ but used nltk and a python wherever possible.

Here is the basic usage:

>>> import summarize

A SimpleSummarizer (currently the only summarizer) makes a summary by using sentences with the most frequent words:

>>> ss = summarize.SimpleSummarizer()
>>> input = "NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text.'

You can specify any number of sentenecs in the summary as you like.

>>> input = "NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries. A Summariser is really cool. I don't think there are any other python summarisers."
>>> ss.summarize(input, 2)
"NLTK is a python library for working human-written text.  I don't think there are any other python summarisers."

Unlike the original algorithm from Classifier4J, this summarizer works correctly with punctuation other than periods:

>>> input = "NLTK is a python library for working human-written text! Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text!'

I'm currently retaining copyright for this module (as much is allowed anyway), but should be releasing it under some kind of open source license very soon. If you want to use it in your software let me know and I'll speed up the process.

Tristan Havelick 2010-02-22 00:22:20

ansaurus

tags:

views:

answers:

How to find out the summarized text of a given URL in python / Django ?

related questions