views:

57

answers:

1

As the title says, I'm using the BS module in Python to parse XML pages that I access from the Amazon API (i create the signed url, load it with liburl2, and then parse with BS).

It takes about 4 seconds to do two pages, but there has to be a faster way

Would PHP be faster? What's making it slow, the BS parsing or the liburl loading?

+2  A: 

If you want to find out what's making it slow, use one of the profilers. I suspect it's the network access (and their underlying database retrieval) that's slower than the rest.

Jason R. Coombs
Thanks, Jason, that's pretty handy!I'm guessing "Total Time" is what I'm looking for and I noticed this: 1488 13.265 0.009 13.265 0.009 {method 'recv' of '_socket.socket' objects}At 13.265seconds I'd bet this is the culprit. May I ask what it means though?
Mike J
@Mike, recv = receive. Usually a blocking call to the socket. Meaning that it's simply waiting for either amazon to respond or for all the data to arrive (depends on how much data and how much bandwidth is available).
wds
@wds thanks! So I guess I don't really have a choice about the speed?
Mike J
@Mike: not really, cache agressively
wds
@Mike You could retrieve multiple pages in parallel.
Fabian
@wds and @ Fabina,Thanks! Would you know of a good place to help me get started in learning about caching and parallel fetching?
Mike J
@Mike caching -> depends how many files and how many times you need them. If it's not too much, consider just holding them in a dictionary. For a more robust solution you might consider memcached. As for parallel fetching, I'd probably go with multiprocessing to handle the different fetchers and their I/O http://docs.python.org/library/multiprocessing.html
wds