Suppose I downloaded the HTML code, and I can parse it. How do I get the "best" description of that website, if that website does not have meta-description tag?
You could get the first few sentence returned from something like Readability.
Safari 5 uses it, so it must be alright :)
It's very hard to come up with a rule that works 100% of the time, obviously, but my suggestion as a starting point would be to look for the first <h1>
tag (or <h2>
, <h3>
, etc - the highest one you can find) then the bit of text after that can be used as the description. As long as the site is semantically marked-up, that should give you a good description (I guess you could also take the contents of the <h1>
itself, but that's more like the "title").
It's interesting to note that Google (for example) uses a keyword-specific extract of the page contents to display as the description, rather than a static description. Not sure if that'll work for your situation, though.
To follow up on the "Readability" suggestion above (which itself is inspired by the website InstaPaper), they have release the JavaScript: http://code.google.com/p/arc90labs-readability/. What's more, some guy took that and ported it to python: http://github.com/gfxmonk/python-readability. Rejoice!