views:

60

answers:

1

Did you ever submit a link in your Facebook status? When you do, they do something very nice: They get a title, summary, and bunch of relevant images from that page, and you can choose one of them as thumbnail.

I need something like that right now. Is there any open-source piece of code that does this? (It needs to be in Python because it's a Python app I'm working on.) Or maybe just a guide or a blog post about this? I would really like to learn from other people's experience about this.

Given the URL of a web page, I want to get:

  1. The title: Probably just the <title> tag but possibly the <h1>, not sure.
  2. A one-paragraph summary of the page.
  3. A bunch of relevant images that could be used as a thumbnail. (The tricky part is to filter out irrelevant images like banners or rounded corners.

I may have to implement it myself, but I would at least want to know about how other people have been doing these kinds of tasks.

+2  A: 

BeautifulSoup is suited for most of this process.

You can initialize the soup object, then do something like this to rip out the tags you are interested in:

title = soup.findAll('title')
images = soup.findAll('img')

Then, you could download each of the images based on url using urllib2.

The title is pretty easy, but the images could be a bit more difficult since you have to download them to get stats on them. I'm sure you could filter out the vast majority of images based on size and number of colors. Rounded corners are going to be small and only have 1-2 colors.

As for the page summary, that may be a bit more difficult, but I've been doing something like this. I use BeautifulSoup to remove all style, script, form, head blocks from the html using first .findAll, then .extract. Finally, I grab the text that is left by doing:

' '.join(soup.findAll(text = True))

I presume you could use this "text" content as part of the page summary.

I hope this helps.

orangeoctopus
BeautifulSoup is not well supported on Python 3.1, and its original author doesn't do much development anymore. You probably better use lxml.html and/or html5lib (the latter is recommended by the BeautifulSoup author).
lunaryorn
Good to know for future reference. Thanks!
orangeoctopus