views:

1327

answers:

4

Hey guys,

Consider the following Python code:

 30    url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
 31    url_object = urllib.request.urlopen(url);
 32    print(url_object.read());

When this is run, an Exception is thrown:

File "/usr/local/lib/python3.0/urllib/request.py", line 485, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden

However, when this is put into a browser, the search returns as expected. What's going on here? How can I overcome this so I can search Google programmatically?

Any thoughts? --Shafik

A: 

You're doing it too often. Google has limits in place to prevent getting swamped by search bots. You can also try setting the user-agent to something that more closely resembles a normal browser.

Joel Coehoorn
I have only tried twice today.
AgentLiquid
Wrong answer. It blocks on the first attempt.
nosklo
that's right user-agent makes all the difference.
Evgeny
+14  A: 

If you want to do Google searches "properly" through a programming interface, take a look at Google APIs. Not only are these the official way of searching Google, they are also not likely to change if Google changes their result page layout.

lacqui
Do you have idea what's going on under the hood though? I'm curious ... why doesn't url.read() look like a standard browser read?
AgentLiquid
what sort of moron would vote this post "offensive"?
Paul Tomblin
Instead of going through the web interfaces, these APIs directly access the search XML. They connect to a different page at Google, which gives you data in a different format.Basically, you were getting 403 because you weren't allowed to access the data the way you were, and Google knew it (...)
lacqui
(...) because your app either (a) didn't send a User-Agent string or (b) sent a default one that Google recognized as a robot (see http://google.com/robots.txt)
lacqui
Awesome explanation, thank you.
AgentLiquid
The problem with their api's are that they don't return the same results as google.com. See http://code.google.com/p/google-ajax-apis/issues/detail?id=43
Anders Rune Jensen
+4  A: 

this should do the trick

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "http://www.google.com/search?hl=en&safe=off&q=Monkey"
headers={'User-Agent':user_agent,} 

request=urllib2.Request(url,None,headers) //The assembled request
response = urllib2.urlopen(request)
data = response.read() // The data u need
Could you please format your code? (Just select it and press ctrl-k.)
Stephan202