Writing a Faster Python Spider

views:

1392

answers:

+3 Q:

Writing a Faster Python Spider

Hello, I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.

What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.

I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.

Thanks for all the help :)

+4 A:

You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.

The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.

If you are spidering just one site, consider using wget -r (an example).

John Paulett 2009-12-05 22:34:33

Does wget support only fetching the html? I don't want to be more of a drain then I have to.

MMag 2009-12-06 00:45:54

Yep, the `-A` flag on wget let's you specify the accepted extensions, and `-R` let's you specify the rejected extensions. So you could do something like `wget -R.gif,.jpg,.png -r example.com`

John Paulett 2009-12-06 02:06:59

+1 for wget, why reinvent the wheel?

extraneon 2009-12-06 10:01:00

Would a wget with C++/C# be faster(if multi-threaded) than the equivalent in Python?

MMag 2009-12-06 16:17:39

The slowest part of the program is going to be the network connections. Each connection is going to be at least 100ms, and switching from Python to C/C++ is only going to shave a few milliseconds. In the grand scheme of things, Python will be just as fast.

John Paulett 2009-12-06 17:43:06

Thanks, I'm going to use wget with C#, as I already know C

MMag 2009-12-07 02:39:57

+2 A:

Spidering somebody's site with millions of requests isn't very polite. Can you instead ask the webmaster for an archive of the site? Once you have that, it's a simple matter of text searching.

Greg Hewgill 2009-12-05 22:34:37

Website isn't in English, and I don't speak the language to ask. Also, it is a very high traffic image site(another reason I can't really ask) and I only need the html of each page, so hopefully I won't be a drain on their servers. I only plan to spider each page once, and to never come back for updates.

MMag 2009-12-06 00:44:20

+2 A:

Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.

BrainCore 2009-12-05 23:27:27

I just have one workstation with a small Raid array, and the computers of a few volunteers. I only plan to store some of the pages that have the relevant search term, which should be <1%

MMag 2009-12-06 00:47:51

The less data to store the better. If you find yourself needing more computational power (parallelism will greatly speed up what you're doing), definitely try out PiCloud.

BrainCore 2009-12-06 13:28:52

+2 A:

You waste a lot of time waiting for network requests when spidering, so you'll definitely want to make your requests in parallel. I would probably save the result data to disk and then have a second process looping over the files searching for the term. That phase could easily be distributed across multiple machines if you needed extra performance.

Adam Pope 2009-12-05 23:35:06

+1 A:

As you are new to Python, I think the following may be helpful for you :)

if you are writing regex to search for certain pattern in the page, compile your regex wherever you can and reuse the compiled object
BeautifulSoup is a html/xml parser that may be of some use for your project.

Satoru.Logic 2009-12-06 06:27:12

Fixed some of the formatting of your post, hope you don't mind :) Welcome to SO!

onnodb 2009-12-06 06:39:23

thx for your editing :)

Satoru.Logic 2009-12-06 11:03:31

What Adam said. I did this once to map out Xanga's network. The way I made it faster is by having a thread-safe set containing all usernames I had to look up. Then I had 5 or so threads making requests at the same time and processing them. You're going to spend way more time waiting for the page to DL than you will processing any of the text (most likely), so just find ways to increase the number of requests you can get at the same time.

Claudiu 2009-12-06 06:55:06

ansaurus

tags:

views:

answers:

Writing a Faster Python Spider

related questions