How to retrieve google pages

views:

answers:

How to retrieve google pages

Dear all,I am now using a webtool

http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=

to parse a webpage.

For example,we can parse newyorktimes homepage,we do:

http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http%3A//www.nytimes.com/pages/world/index.html

in the address bar of our browser,it will parse things nicely for us.

However,it just fails for google pages. For example,if I want to parse Google news headpage,like:

http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http%3A//news.google.com/nwshp?hl=en&tab=wn

I will always get 500 Internal Server Error.

I am sure that is somthing to do with google website,I think probably we need some API for google,does anyone have any idea how to to sort this out for google pages? Many thanks.

+2 A:

Per the google.com robots.txt file, you are explictly requested not to scrape their content. Google does not provide an API for machine-readable search results; they want to control the presentation of their content via widgets and embedding strategies.

Jonathan Feinberg 2009-12-11 04:00:07

Thanks,Jonanthan,helpsHow about Yahoo! or Bing?

Robert 2009-12-11 04:05:56

Actually, Robert should read the robots.txt file. Some parts of Google -are- explicitly allowed for scraping.

Chip Uni 2009-12-11 04:25:46

Not the search results, no.

Jonathan Feinberg 2009-12-11 04:27:08

ansaurus

tags:

views:

answers:

How to retrieve google pages

related questions