views:

34

answers:

2

I'm consuming (via urllib/urllib2) an API that returns XML results. The API always returns the total_hit_count for my query, but only allows me to retrieve results in batches of, say, 100 or 1000. The API stipulates I need to specify a start_pos and end_pos for offsetting this, in order to walk through the results.

Say the urllib request looks like "http://someservice?query='test'&start_pos=X&end_pos=Y".

If I send an initial 'taster' query with lowest data transfer such as http://someservice?query='test'&start_pos=1&end_pos=1 in order to get back a result of, for conjecture, total_hits = 1234, I'd like to work out an approach to most cleanly request those 1234 results in batches of, again say, 100 or 1000 or...

This is what I came up with so far, and it seems to work, but I'd like to know if you would have done things differently or if I could improve upon this:

hits_per_page=100 # or 1000 or 200 or whatever, adjustable
total_hits = 1234 # retreived with BSoup from 'taster query'
base_url = "http://someservice?query='test'"
startdoc_positions = [n for n in range(1, total_hits, hits_per_page)]
enddoc_positions = [startdoc_position + hits_per_page - 1 for startdoc_position in startdoc_positions]
for start, end in zip(startdoc_positions, enddoc_positions):
    if end > total_hits:
        end = total_hits
    print "url to request is:\n ",
    print "%s&start_pos=%s&end_pos=%s" % (base_url, start, end)

p.s. I'm a long time consumer of StackOverflow, especially the Python questions, but this is my first question posted. You guys are just brilliant.

+1  A: 

I'd suggest using

positions = ((n, n + hits_per_page - 1) for n in xrange(1, total_hits, hits_per_page))
for start, end in positions:

and then not worry about whether end exceeds hits_per_page unless the API you're using really cares whether you request something out of range; most will handle this case gracefully.

P.S. Check out httplib2 as a replacement for the urllib/urllib2 combo.

Hank Gay
A slice of fried gold, thank you. I doff my hat. Now, how do I 'virtually' doff my hat to your excellent input?
craigs
The upvote was a nice start ;-)
Hank Gay
+1  A: 

It might be interesting to use some kind of generator for this scenario to iterate over the list.

def getitems(base_url, per_page=100):
    content = ...urllib...
    total_hits = get_total_hits(content)
    sofar = 0
    while sofar < total_hits:
        items_from_next_query = ...urllib...
        for item in items_from_next_query:
            sofar += 1
            yield item

Mostly just pseudo code, but it could prove quite useful if you need to do this many times by simplifying the logic it takes to get the items as it only returns a list which is quite natural in python.

Save you quite a bit of duplicate code also.

xyld