views:

458

answers:

3

Hi, I'm writing a simple content management system. I need to store SHA1 hash values that are computed externally as the primary key for my biggest table.

I can obviously use a sequence as a primary key and index the SHA1 hex-string for look-up... However, I'm looking for a more elegant solution, where I will simply use the 20-byte SHA1 computed values as the given key to the rows I am about to insert/delete/update in the database table. Is there an efficient storage type that I can use to store and later on use the SHA1 keys as primary keys?

I will obviously need postgres to support using 20-byte values as keys to get this done.

Anyone with any ideas?

+1  A: 

You could either convert to hex or base64 and use a varchar column, or try just storing it in a bytea-typed column. I'd try making tables with a bunch of random values in both formats and see how they perform.

See the PostgreSQL docs on bytea for info on that type.

Walter Mundt
A: 

Be careful with what this can do to your index btrees. Since the SHA1 won't be sequential, your writes will be very slow due to all the jumping around in the btree.

If a sequence won't work, I usually would recommend a sequential GUID/UUID (see SQL Server's NEWSEQUENTIALID() for example) of some sort.

If you want to make the SHA1 your primary key after knowing this, you can convert it to a standard hex format that SHA1 is usually shown in (makes it easy to type). I wouldn't recommend a binary format as you won't be able to type it for debugging, etc.

wojo
Writes to a `B-Tree` will be sequential anyway, it's the searching for the page to link with that will jump around. However, even distribution of the values will make the tree more balanced and the searching faster, not slower.
Quassnoi
I guess I was referring to the way some database servers order pages according to the clustered index, but that's SQL Server, I don't know if it applies to pgsql. Hmm! But you're right, the tree will be balanced very well (almost perfectly)
wojo
`@wojo`: Even with clustered tables, `SQL Server` keeps a `B-Tree` order, not the physical order. The rows are not necessarily ordered physically, only logically. http://msdn.microsoft.com/en-us/library/ms177443(SQL.90).aspx
Quassnoi
@Quassnoi hmm, it was my understanding that with a clustered index that the pages on disk are ordered according to the index, and with random values likes GUIDs or SHA1s, you'll end up with both logical and physical fragmentation from page splits on inserts. is this not true? Similar questions for GUIDs (which would apply to any hash): http://stackoverflow.com/search?q=clustered+index+guid
wojo
`@wojo`: Inserts of sequential values won't result in page splits but will result in an unbalanced `B-Tree` (actually, a `B+Tree`). There is no such thing as a "logical fragmentation" in a `B+Tree` index (it's page-fragmented by design so that you should always traverse the linked list to get to the next page). This linked list will point to the next page (in physical order) when the values are sequential, and to some random page (probably far away) if they are not.
Quassnoi
But you should only get to the next page when searching for more then one value: if you are doing an `EQ-REF` search (`id = value`), one page is enough. A physical fragmentation is, therefore, only bad when you do range queries (`id BETWEEN start AND end`) which will be never used for `SHA1` hashes. When you are not doing the searches like that (and you will never do for `SHA1` hashes), having a balanced tree is more important than having rows contiguous logically being contiguous physically as well (which is the point of sequential values).
Quassnoi
A: 

Particularly if you will do binary parameters into the db (through libpq for example), use bytea. If you want to do lots of manipulation through simple text queries, convert to hext and store in a text or varchar column.

PostgreSQL will of course have no problems in general with 20 byte keys, other than that the performance overhead is of course greater than with a sequence.

Magnus Hagander