views:

54

answers:

1

Hello, I'm trying to querry blackle.com for searches, but I get an 403 HTTP error, can somebody point out what is wrong here?

#!/usr/bin/env python

import urllib2
ss = raw_input('Please enter search string: ')
response = "http://www.google.com/cse?cx=013269018370076798483:gg7jrrhpsy4&cof=FORID:1&q=" + ss + "&sa=Search"
urllib2.urlopen(response)
html = response.read()

print html
+2  A: 

HTTP 403 means "forbidden" (see here for a good explanation): google.com doesn't want to let you access that resource. Since it does let browsers access it, presumably it's identifying you as a robot (automated code, not interactive user browser), through user agent checking and the like. Have you checked robots.txt to see if you SHOULD be allowed to access such URLs? In http://www.google.com/robots.txt I see one line:

Disallow: /cse?

which means robots are NOT allowed here. See here for explanations of robots.txt, here for the standard Python library module roboparser that makes it easy for a Python program to understand a robots.txt file.

You could try fooling google's detection of "robots" vs humans, e.g. by falsifying your user agent header and so on, and maybe you'd get away with it for a while, but do you really want to deliberately violate the terms of use and get into a fight about it with google...?

Alex Martelli