views:

166

answers:

3

So say I'm using BeautifulSoup to parse pages and my code figures out that there are at least 7 pages to a query.

The pagination looks like

 1 2 3 4 5 6 7 Next

If I paginate all the way to 7, sometimes there are more than 7 pages, so that if I am on page 7, the pagination looks like

 1 2 3    7 8 9 10 Next

So now, I know there are at least 3 more pages. I am using an initial pass to figure out how many pages i.e. get_num_pages returns 7

What I am doing is iterating over items on each page so I have something like

for page in range(1,num_pages + 1):
  # do some stuff here

Is there a way to dynamically update the range if the script figures out there are more than 7 pages? I guess another approach is to keep a count and as I get to page 7, handle that separately. I'm looking for suggestions and solutions for the best way to approach this.

+6  A: 

You could probably çreate a generator that has mutable state that determines when it terminates... but what about something simple like this?

page = 1
while page < num_pages + 1:
    # do stuff that possibly updates num_pages here
    page += 1
John
+1: It was never a range to begin with.
S.Lott
+2  A: 

Here's a code free answer, but I think it's simple if you take advantage of what beautiful soup lets you do:

To start with, on the first page you have somewhere the page numbers & links; from your question they look like this:

1 2 3 4 5 6 7 [next]

Different sites handle paging differently, some give a link to jump to beginning/end, but on yours you say it looks like this after the first 7 pages:

1 2 3 ... 7 8 9 10 [next]

Now, at some point, you will get to the end, it's going to look like this:

1 2 3 ... 20 21 22 23

Notice there's no [next] link.

So forget about generators and ranges and keeping track of intermediate ranges, etc. Just do this:

  1. use beautiful soup to identify the page # links on a given page, along with the next button.
  2. Every time you see a [next] link, follow it and reparse with beautiful soup
  3. When you hit a page where there is no next link, the last # page link is the total number of pages.
+1  A: 

I like John's while-based solution, but to use a for you could do something like:

pages = range(1, num_pages+1)
for p in pages:
   ...possibly pages.extend(range(something, something)) here...

that is, you have to give a name to the range you're looping on, so you can extend it when needed. Changing the container you're iterating on is normally frowned upon, but in this specific and highly-constrained case it can actually be a useful idiom.

Alex Martelli