I'm consuming (via urllib
/urllib2
) an API that returns XML results. The API always returns the total_hit_count for my query, but only allows me to retrieve results in batches of, say, 100 or 1000. The API stipulates I need to specify a start_pos and end_pos for offsetting this, in order to walk through the results.
Say the urllib request looks like "http://someservice?query='test'&start_pos=X&end_pos=Y".
If I send an initial 'taster' query with lowest data transfer such as http://someservice?query='test'&start_pos=1&end_pos=1
in order to get back a result of, for conjecture, total_hits = 1234
, I'd like to work out an approach to most cleanly request those 1234 results in batches of, again say, 100 or 1000 or...
This is what I came up with so far, and it seems to work, but I'd like to know if you would have done things differently or if I could improve upon this:
hits_per_page=100 # or 1000 or 200 or whatever, adjustable
total_hits = 1234 # retreived with BSoup from 'taster query'
base_url = "http://someservice?query='test'"
startdoc_positions = [n for n in range(1, total_hits, hits_per_page)]
enddoc_positions = [startdoc_position + hits_per_page - 1 for startdoc_position in startdoc_positions]
for start, end in zip(startdoc_positions, enddoc_positions):
if end > total_hits:
end = total_hits
print "url to request is:\n ",
print "%s&start_pos=%s&end_pos=%s" % (base_url, start, end)
p.s. I'm a long time consumer of StackOverflow, especially the Python questions, but this is my first question posted. You guys are just brilliant.