views:

58

answers:

1

CouchDB is convenient to develop (CouchApps) locally and then push into remote production. Unfortunately with production-sized data sets, working on views can be cumbersome.

What are good ways to take samples of a CouchDB database for use in local development?

+2  A: 

The answer is filtered replication. I like to do this in two parts:

  1. Replicate the production database, example_db to my local server as example_db_full
  2. Perform filtered replication from example_db_full to example_db, where the filter cuts out enough data so builds are fast, but keeps enough data so I can confirm my code works.

Which documents to select can be application-specific. At this time, I am satisfied with a simple random pass/fail with a percentage that I can specify. The randomness is consistent (i.e., the same document always passes or always fails.)

My technique is to normalize the content checksum in the document _rev field on a range of [0.0, 1.0). Then I simply specify some fraction (e.g. 0.01), and if the normalized checksum value is <= my fraction, the document passes.

function(doc, req) {
  if(/^_design\//.test(doc._id))
    return true;

  if(!req.query.p)
    throw {error: "Must supply a 'p' parameter with the fraction"
                  + " of documents to pass [0.0-1.0]"};

  var p = parseFloat(req.query.p);
  if(!(p >= 0.0 && p <= 1.0)) // Also catches NaN
    throw {error: "Must supply a 'p' parameter with the fraction of documents"
                  + " to pass [0.0-1.0]"};

  // Consider the first 8 characters of the doc checksum (for now, taken
  // from _rev) as a real number on the range [0.0, 1.0), i.e.
  // ["00000000", "ffffffff").
  var ONE = 4294967295; // parseInt("ffffffff", 16);
  var doc_val = parseInt(doc._rev.match(/^\d+-([0-9a-f]{8})/)[1], 16);

  return doc_val <= (ONE * p);
}
jhs
It dawns on me that my final implementation does not have the 0.0, 1.0 property of my initial idea. Really it is 00... - ff... hex integers. But the principle is the same.
jhs