tags:

views:

109

answers:

1

I posted a url to a blog post in a Facebook message http://www.autoblog.com/2009/06/22/we-are-all-bumblebee-beijing-transformers-fans-gather-to-celebr/ and Facebook inlined the title and abbreviated text as if it fetched them from the rss feed http://www.autoblog.com/rss.xml but when I submitted the link the blog post was already expired out of the feed - I checked.

see this screenshot: http://i43.tinypic.com/nwbu4m.jpg

Is it using a feedburner search? How can this be similarly accomplished?

cheers

+3  A: 

I think they do some advanced scraping looking for the most significant blocks of data and HTML and using that. Basically, they analyze everything quickly, toss out ads, etc. and use the big blobs of data.

Digg is doing similar things aswell.

I would do this to implement it.

  1. Scan for meta tags, rss feed tags, and the title tag.
  2. Find large "areas" with a lot of content. Also include p tags. Weight or grade them on the likelihood of them being content. Look for keyword css classes/id (e.g. rate "content" higher than "ads" or "navigation"
  3. Look for large images
  4. Store information about the site for future use and improved heuristics

This is all done on the server-side likely, and served to the browser using AJAX.

Daniel A. White
I think you're right, it's definitely served to the browser via Ajax (confirmed using Firebug). Certainly the server-side stuff is pretty complicated.For some pages that don't have big "blobs" of textual data, their algorithm seems to fall back to some simpler things, like <meta> tags. For example, for this linkhttp://www.theweathernetwork.com/weather/caon0493The <meta name="description"> is used.
Peter
thanks for the suggestion. I was hoping screen scraping could be avoided, but dang, that's not a fun thing to implement scalably.
john
I actually know someone that was working on something like this using part of WebKit.
Daniel A. White

related questions