ansaurus

Question

Sample a large CouchDB database for local development, avoiding long view builds

Answer 1

+2 A:

The answer is filtered replication. I like to do this in two parts:

Replicate the production database, example_db to my local server as example_db_full
Perform filtered replication from example_db_full to example_db, where the filter cuts out enough data so builds are fast, but keeps enough data so I can confirm my code works.

Which documents to select can be application-specific. At this time, I am satisfied with a simple random pass/fail with a percentage that I can specify. The randomness is consistent (i.e., the same document always passes or always fails.)

My technique is to normalize the content checksum in the document _rev field on a range of [0.0, 1.0). Then I simply specify some fraction (e.g. 0.01), and if the normalized checksum value is <= my fraction, the document passes.

function(doc, req) {
  if(/^_design\//.test(doc._id))
    return true;

  if(!req.query.p)
    throw {error: "Must supply a 'p' parameter with the fraction"
                  + " of documents to pass [0.0-1.0]"};

  var p = parseFloat(req.query.p);
  if(!(p >= 0.0 && p <= 1.0)) // Also catches NaN
    throw {error: "Must supply a 'p' parameter with the fraction of documents"
                  + " to pass [0.0-1.0]"};

  // Consider the first 8 characters of the doc checksum (for now, taken
  // from _rev) as a real number on the range [0.0, 1.0), i.e.
  // ["00000000", "ffffffff").
  var ONE = 4294967295; // parseInt("ffffffff", 16);
  var doc_val = parseInt(doc._rev.match(/^\d+-([0-9a-f]{8})/)[1], 16);

  return doc_val <= (ONE * p);
}

jhs 2010-08-30 21:02:41

It dawns on me that my final implementation does not have the 0.0, 1.0 property of my initial idea. Really it is 00... - ff... hex integers. But the principle is the same.

jhs 2010-08-30 21:10:12

ansaurus

tags:

views:

answers:

Sample a large CouchDB database for local development, avoiding long view builds

related questions