views:

365

answers:

1

Hello,

I'm trying to create a pagination index view in CouchDB that lists the doc._id for every Nth document found.

I wrote the following map function, but the pageIndex variable doesn't reliably start at 1 - in fact it seems to change arbitrarily depending on the emitted value or the index length (e.g. 50, 55, 10, 25 - all start with a different file, though I seem to get the correct number of files emitted).

function(doc) {
  if (doc.type == 'log') {
    if (!pageIndex || pageIndex > 50) {
      pageIndex = 1;
      emit(doc.timestamp, null);
    }
    pageIndex++;
  }
}

What am I doing wrong here? How would a CouchDB expert build this view?

Note that I don't want to use the "startkey + count + 1" method that's been mentioned elsewhere, since I'd like to be able to jump to a particular page or the last page (user expectations and all), I'd like to have a friendly "?page=5" URI instead of "?startkey=348ca1829328edefe3c5b38b3a1f36d1e988084b", and I'd rather CouchDB did this work instead of bulking up my application, if I can help it.

Thanks!

+2  A: 

View functions (map and reduce) are purely functional. Side-effects such as setting a global variable are not supported. (When you move your application to BigCouch, how could multiple independent servers with arbitrary subsets of the data know what pageIndex is?)

Therefore the answer will have to involve a traditional map function, perhaps keyed by timestamp.

function(doc) {
  if (doc.type == 'log') {
    emit(doc.timestamp, null);
  }
}

How can you get every 50th document? The simplest way is to add a skip=0 or skip=50, or skip=100 parameter. However that is not ideal (see below).

A way to pre-fetch the exact IDs of every 50th document is a _list function which only outputs every 50th row. (In practice you could use Mustache.JS or another template library to build HTML.)

function() {
  var ddoc = this,
      pageIndex = 0,
      row;

  send("[");
  while(row = getRow()) {
    if(pageIndex % 50 == 0) {
      send(JSON.stringify(row));
    }
    pageIndex += 1;
  }
  send("]");
}

This will work for many situations, however it is not perfect. Here are some considerations I am thinking--not showstoppers necessarily, but it depends on your specific situation.

There is a reason the pretty URLs are discouraged. What does it mean if I load page 1, then a bunch of documents are inserted within the first 50, and then I click to page 2? If the data is changing a lot, there is no perfect user experience, the user must somehow feel the data changing.

The skip parameter and example _list function have the same problem: they do not scale. With skip you are still touching every row in the view starting from the beginning: finding it in the database file, reading it from disk, and then ignoring it, over and over, row by row, until you hit the skip value. For small values that's quite convenient but since you are grouping pages into sets of 50, I have to imagine that you will have thousands or more rows. That could make page views slow as the database is spinning its wheels most of the time.

The _list example has a similar problem, however you front-load all the work, running through the entire view from start to finish, and (presumably) sending the relevant document IDs to the client so it can quickly jump around the pages. But with hundreds of thousands of documents (you call them "log" so I assume you will have a ton) that will be an extremely slow query which is not cached.

In summary, for small data sets, you can get away with the page=1, page=2 form however you will bump into problems as your data set gets big. With the release of BigCouch, CouchDB is even better for log storage and analysis so (if that is what you are doing) you will definitely want to consider how high to scale.

jhs
Excellent answer, thanks! I assume it would be less scalable than your _list method to take a cached index of all doc._ids (as in your first map function) and basically make the application do the heavy lifting (and maybe writing it back into a cache document)?
Andrew
A cache document is reasonable. Perhaps it could have a timestamp value so you know when it's too old. You could also fetch and build that cache document yourself (i.e. don't require that users make the document, they only read it).
jhs