views:

1392

answers:

6

Hello, I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.

What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.

I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.

Thanks for all the help :)

+4  A: 

You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.

The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.

If you are spidering just one site, consider using wget -r (an example).

John Paulett
Does wget support only fetching the html? I don't want to be more of a drain then I have to.
MMag
Yep, the `-A` flag on wget let's you specify the accepted extensions, and `-R` let's you specify the rejected extensions. So you could do something like `wget -R.gif,.jpg,.png -r example.com`
John Paulett
+1 for wget, why reinvent the wheel?
extraneon
Would a wget with C++/C# be faster(if multi-threaded) than the equivalent in Python?
MMag
The slowest part of the program is going to be the network connections. Each connection is going to be at least 100ms, and switching from Python to C/C++ is only going to shave a few milliseconds. In the grand scheme of things, Python will be just as fast.
John Paulett
Thanks, I'm going to use wget with C#, as I already know C
MMag
+2  A: 

Spidering somebody's site with millions of requests isn't very polite. Can you instead ask the webmaster for an archive of the site? Once you have that, it's a simple matter of text searching.

Greg Hewgill
Website isn't in English, and I don't speak the language to ask. Also, it is a very high traffic image site(another reason I can't really ask) and I only need the html of each page, so hopefully I won't be a drain on their servers. I only plan to spider each page once, and to never come back for updates.
MMag
+2  A: 

Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.

BrainCore
I just have one workstation with a small Raid array, and the computers of a few volunteers. I only plan to store some of the pages that have the relevant search term, which should be <1%
MMag
The less data to store the better. If you find yourself needing more computational power (parallelism will greatly speed up what you're doing), definitely try out PiCloud.
BrainCore
+2  A: 

You waste a lot of time waiting for network requests when spidering, so you'll definitely want to make your requests in parallel. I would probably save the result data to disk and then have a second process looping over the files searching for the term. That phase could easily be distributed across multiple machines if you needed extra performance.

Adam Pope
+1  A: 

As you are new to Python, I think the following may be helpful for you :)

  • if you are writing regex to search for certain pattern in the page, compile your regex wherever you can and reuse the compiled object
  • BeautifulSoup is a html/xml parser that may be of some use for your project.
Satoru.Logic
Fixed some of the formatting of your post, hope you don't mind :) Welcome to SO!
onnodb
thx for your editing :)
Satoru.Logic
A: 

What Adam said. I did this once to map out Xanga's network. The way I made it faster is by having a thread-safe set containing all usernames I had to look up. Then I had 5 or so threads making requests at the same time and processing them. You're going to spend way more time waiting for the page to DL than you will processing any of the text (most likely), so just find ways to increase the number of requests you can get at the same time.

Claudiu