views:

75

answers:

3

Suppose I downloaded the HTML code, and I can parse it. How do I get the "best" description of that website, if that website does not have meta-description tag?

+3  A: 

You could get the first few sentence returned from something like Readability.

Safari 5 uses it, so it must be alright :)

alex
+1 for Readability. Neat tool.
Nick Presta
+1 for Readability :)
Kit
+1  A: 

It's very hard to come up with a rule that works 100% of the time, obviously, but my suggestion as a starting point would be to look for the first <h1> tag (or <h2>, <h3>, etc - the highest one you can find) then the bit of text after that can be used as the description. As long as the site is semantically marked-up, that should give you a good description (I guess you could also take the contents of the <h1> itself, but that's more like the "title").

It's interesting to note that Google (for example) uses a keyword-specific extract of the page contents to display as the description, rather than a static description. Not sure if that'll work for your situation, though.

Dean Harding
+1  A: 

To follow up on the "Readability" suggestion above (which itself is inspired by the website InstaPaper), they have release the JavaScript: http://code.google.com/p/arc90labs-readability/. What's more, some guy took that and ported it to python: http://github.com/gfxmonk/python-readability. Rejoice!

loevborg