views:

57

answers:

3

I am working on a project that requires me to collect a large list of URLs to websites about certain topics. I would like to write a script that will use google to search specific terms, then save the URLs from the results to a file. How would I go about doing this? I have used a module called xgoogle, but it always returned no results.

I am using Python 2.6 on Windows 7.

+1  A: 

google has an API library. I'd recommend you use that: http://code.google.com/apis/ajaxsearch/

it's a restful API, which means its easy to grab results via python/js. You're limited to 32 results, I think, but that should be enough. it'll return a nice structured object that you'll be able to work with without having to do anything with html parsing.

if you wanted to 'crawl', you could then use urllib to grab each of those urls and get THEIR contents, and the urls they refer to, and on and on.

Oren Mazor
How would I use ulllib to do that. That is really what I'm wanting to do, crawl each page I find and follow it's links, storing each link I find before crawling. I have looked in to Googles API, but they no longer use it.
Zachary Brown
well the basic way would be to grab the page contents and then use a regular expression to find all of the links. but that gets messy fast. Instead, check out beautiful soup. it's good for processing html
Oren Mazor
A: 

Make sure that you change the User-Agent of urllib2. The default one tends to get blocked by Google. Make sure that you obey the terms of use of the search engine that you're scripting.

Tim McNamara
+1  A: 

Maybe this might help http://developer.yahoo.com/search/boss/ :)

They even have a python interface http://developer.yahoo.com/search/boss/mashup.html

bronzebeard
+1 for boss. Looks interesting.
Mark