tags:

views:

62

answers:

2

I've been wondering about this for some time. In CouchDB we have some fairly log IDs...eg:

"000ab56cb24aef9b817ac98d55695c6a"

Now if we're searching for this item and going through the tree structure created by the view. It seems a simple integer as an id would be much faster. If we used 64bit integers it would be a simple CMP followed by a JMP (assuming that the Erlang code was using JIT, but you get my point).

For strings, I assume we generate a hash off the ID or something, but at some point we have to do a character compare on all 33 characters...won't that affect performance?

+1  A: 

From the CouchDB: The definitive guide book:

I need to draw a picture of this at some point, but the reason is if you think of the idealized btree, when you use UUID’s you might be hitting any number of root nodes in that tree, so with the append only nature you have to write each of those nodes and everything above it in the tree. but if you use monotonically increasing id’s then you’re invalidating the same path down the right hand side of the tree thus minimizing the number of nodes that need to be rewritten. would be just the same for monotonically decreasing as well. and it should technically work if you’re updates can be guaranteed to hit one or two nodes in the inside of the tree, though that’s much harder to prove.

So sequential IDs offer a performance benefit, however, you must remember this isn't maintainable when you have more than one database, as the IDs will collide.

cloudhead
+1  A: 

The short answer is, yes, of course it will affect performance, because the key length will directly impact the time it takes to walk down the tree.

It also affects storage, as longer keys take more space, space takes time.

However, the nuance you are missing is that while Couch CAN (and does) allocated new IDs for you, it is not required to. It will be more than happy to accept your own IDs rather than generate it's own. So, if the key length bothers you, you are free to use shorter keys.

However, given the "json" nature of couch, it's pretty much a "text" based database. There's isn't a lot of binary data stored in a normal Couch instance (attachments not withstanding, but even those I think are stored in BASE64, I may be wrong).

So, while, yes an 64-bit would be the most efficient, the simple fact is that Couch is designed to work for any key, and "any key" is most readily expressed in text.

Finally, truth be told, the cost of the key compare is dwarfed by the disk I/O fetch times, and the JSON marshaling of data (especially on writes). Any real gain achieved by converting to such a system would likely have no "real world" impact on overall performance.

If you want to really speed up the Couch key system, code the key routine to block the key in to 64Bit longs, and comapre those (like you said). 8 bytes of text is the same as a 64 bit "long int". That would give you, in theory, an 8x performance boost on key compares. Whether erlang can create such code, I can't say.

Will Hartung