Search Crawling "Bot"?

views:

answers:

Search Crawling "Bot"?

I am working on a project that requires me to collect a large list of URLs to websites about certain topics. I would like to write a script that will use google to search specific terms, then save the URLs from the results to a file. How would I go about doing this? I have used a module called xgoogle, but it always returned no results.

I am using Python 2.6 on Windows 7.

+1 A:

google has an API library. I'd recommend you use that: http://code.google.com/apis/ajaxsearch/

it's a restful API, which means its easy to grab results via python/js. You're limited to 32 results, I think, but that should be enough. it'll return a nice structured object that you'll be able to work with without having to do anything with html parsing.

if you wanted to 'crawl', you could then use urllib to grab each of those urls and get THEIR contents, and the urls they refer to, and on and on.

Oren Mazor 2010-09-17 04:12:28

How would I use ulllib to do that. That is really what I'm wanting to do, crawl each page I find and follow it's links, storing each link I find before crawling. I have looked in to Googles API, but they no longer use it.

Zachary Brown 2010-09-17 04:28:01

well the basic way would be to grab the page contents and then use a regular expression to find all of the links. but that gets messy fast. Instead, check out beautiful soup. it's good for processing html

Oren Mazor 2010-09-17 14:25:30

Make sure that you change the User-Agent of urllib2. The default one tends to get blocked by Google. Make sure that you obey the terms of use of the search engine that you're scripting.

Tim McNamara 2010-09-17 04:47:19

+1 A:

Maybe this might help http://developer.yahoo.com/search/boss/ :)

They even have a python interface http://developer.yahoo.com/search/boss/mashup.html

bronzebeard 2010-09-17 05:03:24

+1 for boss. Looks interesting.

Mark 2010-09-17 05:13:53

ansaurus

tags:

views:

answers:

Search Crawling "Bot"?

related questions