views:

50

answers:

1

Dear all,I am now using a webtool

http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=

to parse a webpage.

For example,we can parse newyorktimes homepage,we do:

http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http%3A//www.nytimes.com/pages/world/index.html

in the address bar of our browser,it will parse things nicely for us.

However,it just fails for google pages. For example,if I want to parse Google news headpage,like:

http://fiddesktop.cs.northwestern.edu/mmp/scrape?url=http%3A//news.google.com/nwshp?hl=en&tab=wn

I will always get 500 Internal Server Error.

I am sure that is somthing to do with google website,I think probably we need some API for google,does anyone have any idea how to to sort this out for google pages? Many thanks.

+2  A: 

Per the google.com robots.txt file, you are explictly requested not to scrape their content. Google does not provide an API for machine-readable search results; they want to control the presentation of their content via widgets and embedding strategies.

Jonathan Feinberg
Thanks,Jonanthan,helpsHow about Yahoo! or Bing?
Robert
Actually, Robert should read the robots.txt file. Some parts of Google -are- explicitly allowed for scraping.
Chip Uni
Not the search results, no.
Jonathan Feinberg