ansaurus

Question

Getting (a) title (b) summary and (c) relevant images of web page, a la Facebook status updates

Answer 1

+2 A:

BeautifulSoup is suited for most of this process.

You can initialize the soup object, then do something like this to rip out the tags you are interested in:

title = soup.findAll('title')
images = soup.findAll('img')

Then, you could download each of the images based on url using urllib2.

The title is pretty easy, but the images could be a bit more difficult since you have to download them to get stats on them. I'm sure you could filter out the vast majority of images based on size and number of colors. Rounded corners are going to be small and only have 1-2 colors.

As for the page summary, that may be a bit more difficult, but I've been doing something like this. I use BeautifulSoup to remove all style, script, form, head blocks from the html using first .findAll, then .extract. Finally, I grab the text that is left by doing:

' '.join(soup.findAll(text = True))

I presume you could use this "text" content as part of the page summary.

I hope this helps.

orangeoctopus 2010-07-21 11:57:22

BeautifulSoup is not well supported on Python 3.1, and its original author doesn't do much development anymore. You probably better use lxml.html and/or html5lib (the latter is recommended by the BeautifulSoup author).

lunaryorn 2010-07-21 12:09:45

Good to know for future reference. Thanks!

orangeoctopus 2010-07-21 12:25:42

ansaurus

tags:

views:

answers:

Getting (a) title (b) summary and (c) relevant images of web page, a la Facebook status updates

related questions