views:

947

answers:

5

I am planing a database to store lots of text. (blog posts, news articles, etc.) The database needs to have the title, content (50k characters max), date, link and language fields. The same content can't occur on one link. Old content (older then 30 days, for example) will be deleted.

Now, the problem is the primary key. I could just set a automaticaly incrementing (SERIAL type) field and use it as a prymary key. But, it seems stupid and a waste of disc space, because the field wouldn't serve any purpose but to be a primary key. (and the field could eventually run out, or not?) And there's always the other performance issue: the content of each new row inserted need's to be checked for duplicates. So the other solution for the primary key I've come up with would be to calculate a sha256 hash of content+link value and then put that in a new 'hash' column and use that as a primary key. Two birds with one stone. Ofcorse, the problem with that are hash collisions. Is it a big threat?

I don't have any experiance with PostgreSQL, and very little experiance with DBMS's generaly, so I would appreciate a second opinion before a create a database with the performance characteristics of a snail on the highway (horrible comparison).

Please help me out here if you have any experiance with large databases. Is setting a 64 character string as a prymary key field a good idea in my situation? (because I'm under the impression that generally this is avoided)

+1  A: 

Some suggestions:

  • The disk storage of a 64 bit primary-key integer is negligible no matter how much content you have.
  • You'll never collide SHA256, and using it as a unique id isn't a bad idea.

One nice thing about the hash method is that you don't have a single sequence source to generate new primary keys. This can be useful if your database needs to be segmented in some manner (say geographical distribution) for future scaling, as you don't have to worry about collisions, or a single-point-of-failure that generates sequences.

From a coding perspective, having a single primary key can be vital for joining on extra tables of data you may add in the future. I highly recommend you use one. There are benefits to either of your proposed approaches, but the hash method might be the preferred one, just because autoincrement/sequence values can cause scalability issues sometimes.

bmdhacks
If you use SHA256 as a Primary Key, doesn't it have to be immutable? What happens if content+link value changes?
Robert Harvey
Harvey, good point, but if a would to change the content value of a row, then I would check if a hash of the new content+link alredy exists. If so, no change would happen.
KRTac
So if the content changes, you create a new record?
Robert Harvey
No, I was going to say that the hash would be changed, but that was before reading Lars' answer.
KRTac
Lars makes a good point. While I still think having a algorithmically generated key is better than a sequence, having to change the primary key is bad.
bmdhacks
A: 

Hashes are bad ideas for primary keys. They make the inserts end up in random order in the table, and that gets very costly as things have to be reallocated (though Postgres doesn't really apply that the way others do). I suggest a sequential primary key which may be a fine-grained timestamp / timestamp with sequential number following, letting you kill two birds with a stone, and a second unique index that contains your hash codes. Keep in mind you want to keep your primary key as a smaller (64 bit or less) column.

See the table at http://en.wikipedia.org/wiki/Birthday_attack#The_mathematics so you can be confident you won't have a collision.

Don't forget to vacuum.

Autocracy
+1  A: 

You'd have to have an awful lot of records before your primary key integer ran out.

The integer will be faster for joins than a 64 character string primary key would be. Also it is much easier for people writing queries to deal with.

If a collision is ever possible, you can't use the hash as your primary key. Primary keys must be guarnateed to be unique by definintion.

I've seen hundreds of production databases for different corporations and government entities and not one used a hash primary key. Think there might be a reason?

But, it seems stupid and a waste of disc space, because the field wouldn't serve any purpose but to be a primary key.

Since a surrogate primary key should always be meaningless except as a primary key, I'm not sure what your objection would be.

HLGEM
I'm planing on having a LOT of row's in the table. Old row's will be deleted, but the field will continue to automatically increment the row count, so when the serial field reaches 2147483647 i can't get any more row's in the table, even tho the table could be half empty.
KRTac
You're referring to a 32 bit signed integer. For a 64 bit signed integer, the maximum value is 9,223,372,036,854,775,807. I doubt you would ever exhaust that.
Robert Harvey
Would it be faster then a the hash, evan tho it's 64 bit int?
KRTac
Almost certainly. The hash is four times the size (256 bits).
Robert Harvey
My 2 cents: Make the Primary Key a 64 bit int, and add another column for your hash.
Robert Harvey
jeah, I think it'll be that way. I'll wait if a friend of a friend gives some useful advice and then see what's the best thing to do. Thanks.
KRTac
+1  A: 

I would choose to use a surrogate key, ie. a key that isn't part of the business data of your application. The additional space requirements of an additional 64-bit integer when you're dealing with upto 50 kilobytes of text per record is negligible. You will actually be using less space as soon as you're starting using this key as a foreign key in other tables.

Using a hash of the data stored in a record is a very bad candidate for a primary key, should the data on which the hash is based ever change. You will then have changed the primary key as well, resulting in updates all over the place if you have relations from other tables to this one.

PS. A similar question has been asked and answered here before.

Here's another nice write-up about the topic: http://www.agiledata.org/essays/keys.html

Lars Haugseth
Great advice, and I agree with most of it. But, as I mentioned earlier, what when the surogat key reaches 2147483647? I could use bigserial, but what is the performance hit on that?
KRTac
In this context (storing blog posts and news articles) I doubt reaching 2^31 items would be very likely. Even so, if you want to plan ahead and play it safe and use bigserial, I'd say the perfomance hit will be minimal, especially compared to using a CHAR(32) as PK.
Lars Haugseth
A: 

Just did this exact test for a rather medium-large DB (200GB+), bigserial won by quite a large margin. It was faster to generate, faster to join, less code, smaller footprint. Because of the way postgres stores it, a bigint is negligible compared to a normal int. You'll run out of storage space from your content long before you ever have to worry about overflowing the bigint. Having done the computed hash vs bigint - surrogate bigint all the way.

rfusca