views:

264

answers:

7

Hello all.

I have a list of URLs and am trying to collect their "descriptions." By description I mean what comes up, for example, if you Googled the link. For example, http://stackoverflow.com">Google: http://stackoverflow.com shows the description as

A language-independent collaboratively edited question and answer site for programmers. Questions and answers displayed by user votes and tags.

This the data I'm trying to accumulate for the URLs I have.

I tried parsing the URL's meta-descriptions, however most of them are lacking a meta-description (yet Google and other search engines manage to get a description somehow).

Any ideas? Should I just "google" each link and scrape the data? I have a feeling Google wouldn't like this...

Thanks guys.

+1  A: 

Different search engines have different algorithms to get the description out of the page if/when they are lacking the description meta tag. Some ignore the tag even it it's there.

If you want the description Google has, the most accurate way to get it would be to scrape it. Otherwise, you could write your own or look around on the web for code that does it.

JR Lawhorne
A: 

You may want to check AboutUs.org (i.e. http://www.aboutus.org/StackOverflow.com). But, there's little chance that the site will have an aboutus page and not have a meta description.

dylanfm
A: 

Some info that might explain how google does this:

toolkit
A: 

I am not familiar with Google APIs, but perhaps there is an official way to get such information.

PhiLho
I posted a message in their Group but haven't received a response.
A: 

Interesting. some sources are better than others.

For "audiotuts.com" google has a worse description than AboutUs.com.

Google

Nov 18th in General by Joel Falconer · 1. Recently, an AUDIOTUTS reader asked me about creative process. While this is a topic that can’t be made into a ...

AboutUs.com:

AUDIOTUTS is a blog/tutorial site for musicians, producers and audio junkies! It is the sister site of the popular PSDTUTS, VECTORTUTS and NETTUTS.

I hate problems like these... they should be trivial but they aren't!

A: 

If you can assume English content, you can first look for Meta Description, and if that doesn't work, you can look for the first two or three sentence-like word sequences.

A product I worked on looked for the first P or DIV that contained more than one sequence of > n "words" delimited by periods. It would use the two or three sentence-like sequences, up to x total words, as a summary paragraph. It wasn't 100% accurate, but good enough for the average case. The number of words was adjusted a few times to eliminate things like navigation elements.

JasonTrue
+1  A: 

These are called snippets.

Google use proprietary (and possibly patented) methods to garner this information, so there is no simple answer.

As you suggest, they will use meta-description information if it is there. (How to set the meta-information to help Google.)

They will also honour requests from the page authors to NOT include snippets. (How to prevent Google from displaying snippets) You should probably respect this too (as well as robots.txt, of course.)

You may have some luck with existing auto-summary packages, such as OTS.

Oddthinking