views:

370

answers:

7

I'm designing a new system to store short text messages [sic].

I'm going to identify each message by a unique identifier in the database, and use an AUTO_INCREMENT column to generate these identifiers.

Conventional wisdom says that it's okay to start with 0 and number my messages from there, but I'm concerned about the longevity of my service. If I make an external API, and make it to 2^31 messages, some people who use the API may have improperly stored my identifier in a signed 32-bit integer. At this point, they would overflow or crash or something horrible would happen. I'd like to avoid this kind of foo-pocalypse if possible.

Should I "UPDATE message SET id=2^32+1;" before I launch my service, forcing everyone to store my identifiers as signed 64-bit numbers from the start?

+3  A: 

Actually 0 can be problematic with many persistence libraries. That's because they use it as some sort of sentinel value (a substitute for NULL). Rightly or wrongly, I would avoid using 0 as a primary key value. Convention is to start at 1 and go up. With negative numbers you're likely just to confuse people for no good reason.

cletus
How about starting with 1+2^32 then? The only problem I can see there is that it reduces my available id range by 2^32 values, which likely isn't an issue in a 64-bit id space.
slacy
What cletus said. Just make the docs and wsdl (or what have you) clear that ID is an int64 and call it a day.
Wyatt Barnett
The problem is that documentation may not be enough -- I can never guarantee that every API user "does the right thing" for these numbers, and may end up breaking at 2^31-1 if they use signed int32, and 2^32+1 if they used unsigned int. I can never tell what they're doing, since I don't control their code. Twitpocalypse!
slacy
+5  A: 

If you wanted to achieve your goal and avoid the problems that cletus mentioned, the solution is to set your starting value to 2^32+1. There's still plenty of IDs to go and it won't fit in a 32 bit value, signed or otherwise.

Of course, documenting the value's range and providing guidance to your API or data customers is the only right solution. Someone's always going to try and stick a long into a char and wonder why it doesn't work (always)

caskey
+3  A: 

What if you provided a set of test suites or a test service that used messages in the "high but still valid" range and persuade your service users to use it to validate their code is proper? Starting at an arbitrary value for defensive reasons is a little weird to me; providing sanity tests rubs me right.

Talljoe
What if I chose the largest round number above 2^32? Say, 5000000000? Does that make you feel better?
slacy
I like the idea of a regression suite. Can someone please suggest this to the people at http://apiwiki.twitter.com
slacy
A little better. :) I once chose a large non-round number to seed a publicly-viewable ID so it looked we had more traffic than we did. Doing it to pander to people that can't read the documentation -- less interested. ;)
Talljoe
+1  A: 

If everyone alive on the planet sent one message per second every second non-stop, your counter wouldn't wrap until the year 2050 using 64 bit integers.

Probably just starting at 1 would be sufficient.

(But if you did start at the lower bound, it would extend into the start of 2092.)

lavinio
Your math sucks, lavinio. ;)32-bit signed integers max out at ~2 billion, unsigned at ~4. There are more than 7 billion people on earth.And Twitter has already seen this happen just this week.http://www.twitpocalypse.com/
richardtallent
Using 64 bit signed integers:2^63 ÷ 7,000,000,000 ÷ 365.25 ÷ 24 ÷ 60 ÷ 60 ≈ 412009 + 41 = 2050(The original post mentioned 64-bit integers; that's what I went with. My math doesn't suck, my English does ;).)
lavinio
Exceeding 2^63 isn't the issue, it's exceeding 2^31 that is, so although the 2050 math is right, 2 billion (i.e. 2^31) isn't that big of a number anymore, especially when you've got scripts generating messages, not machines.
slacy
+1  A: 

Don't want to be the next Twitter, eh? lol

If you're worried about scalability, consider using a GUID (uniqueidentifier) instead.

They are only 16 bytes (twice that of a bigint), but they can be assigned independently on multiple database or BL servers without worrying about collisions.

Since they are random, use NEWSEQUENTIALID() (in SQL Server) or a COMB technique (in your business logic or pre-MSSQL 2005 database) to ensure that each GUID is "higher" than the last one (speeds inserts into your table).

If you start with a number that high, some "genius" programmer will either subtract 2^32 to squeeze it in an int, or will just ignore the first digit (which is "always the same" until you pass your first billion or so messages).

richardtallent
The temptation of an AUTO_INCREMENT value is very high, although I'm thinking now that maybe I'll just use a random 128-bit value for each entry. I'm not really sure that I need something as sophisticated (and heavyweight) as a GUID. The identifiers are private to my system. The problem is that what I'd really like to do is to have my RDBMS engine (MySQL) auto-assign these values.
slacy
+2  A: 

Why use incrementing IDs? These require locking and will kill any plans for distributing your service over multiple machines. I would use UUIDs. API users will likely store these as opaque character strings, which means you can probably change the scheme later if you like.

If you want to ensure that messages have an order, implement the ordering like a linked list:

---
id: 61746144-3A3A-5555-4944-3D5343414C41
msg: "Hello, world"
next: 006F6F66-0000-0000-655F-444E53000000
prev: null
posted_by: jrockway
---
id: 006F6F66-0000-0000-655F-444E5300000
msg: "This is my second message EVER!"
next: 00726162-0000-0000-655F-444E53000000
prev: 61746144-3A3A-5555-4944-3D5343414C41
posted_by: jrockway
---
id: 00726162-0000-0000-655F-444E53000000
msg: "OH HAI"
next: null
prev: 006F6F66-0000-0000-655F-444E5300000
posted_by: jrockway

(As an aside, if you are actually returning the results as YAML, you can use & and * references instead of just using the IDs as data. Then the client will get the linked-list structure "for free".)

jrockway
A: 

One thing I don't understand is why developers don't grasp that they don't need to expose their AUTO_INCREMENT field. For example, richardtallent mentioned using Guids as the primary key. I say do one better. Use a 64bit Int for your table ID/Primary Key, but also use a GUID, or something similar, as your publicly exposed ID.

An example Message table:

Name           | Data Type
-------------------------------------
Id             | BigInt - Primary Key
Code           | Guid
Message        | Text
DateCreated    | DateTime

Then your data looks like:

Id | Code                                   Message   DateCreated
-------------------------------------------------------------------------------
1  | 81e3ab7e-dde8-4c43-b9eb-4915966cf2c4 | ....... | 2008-09-25T19:07:32-07:00
2  | c69a5ca7-f984-43dd-8884-c24c7e01720d | ....... | 2007-07-22T18:00:02-07:00
3  | dc17db92-a62a-4571-b5bf-d1619210245a | ....... | 2001-01-09T06:04:22-08:00
4  | 700910f9-a191-4f63-9e80-bdc691b0c67f | ....... | 2004-08-06T15:44:04-07:00
5  | 3b094cf9-f6ab-458e-965d-8bda6afeb54d | ....... | 2005-07-16T18:10:51-07:00

Where Code is what you would expose to the public whether it be a URL, Service, CSV, Xml, etc.

Jordan S. Jones
What is the point of an ID column in your example?
jrockway
Still used internally for Foreign Keys and it could still be used in internal applications. The whole point, is that you don't have to expose it publicly.
Jordan S. Jones
The thing I don't like about the GUID based ideas, is that it means that I would likely be exposing these large and unweildy numbers in my URLs. i.e. http://mysite/message/3b094cf9-f6ab-458e-965d-8bda6afeb54d instead of http://mysite/message/5. I like the latter, although once you do get into the billions, there's not a huge difference between the 2 schemes.
slacy
Personally, I'm not much of a GUID fan for the same reasons as you. In the past I've used a 0 padded string that was a combination of a random number and a sequetial number with a character prefix. For example MSG003240001 or MSG008290002.
Jordan S. Jones