views:

104

answers:

6

I have this application that uses natural primary keys. The database uses the WE8ISO8859P15 character set. So in my table City whe have primary keys like 'MEDELLÍN' and 'MÜNCHEN'. I have a hunch we are going to have a lot of trouble with this.

The problems I see

  • Interfacing this data to databases with another character set. I don't want character set conversion on my primary key
  • Dumping the data to files and processing these files we always have to very aware of the special characters and the client settings

Should we allow diacritics in the PK? Please feel free to give your opinion.

+2  A: 

Like you, I feel it would be really looking for problems to allow them.

In addition to the problems you mention, it could be:

  • Imagine switching to another database vendor ...


I don't know if introducing a surrogate primary key is an option for you, but that could be the correct timing to do so ;-) ...

If not, you could duplicate the column :

  1. the pk column would not be case sensitive, not have special characters and so on ...
  2. an additional column would preserve what was entered by the user, to show it nicely in some UI...
KLE
This is an application not devloped in house so I can't alter the data model. I used to be on the 'natural key' camp. But recent problems made me switch to the 'surrogate key' camp. For instance : the same application allowed users to alter fields that are part of the primary key, breaking the link to records interfaced to other systems.
Robert Merkwürdigeliebe
+1 Never use natural keys as PK unless you want to be in trouble later.
Aaron Digulla
Switching to another DB vendor is not an issue. All relevant DBs today support Unicode, and Unciode is a superset of all ISO8859 character sets.
MSalters
The character code used isn't Unicode (BTW Unicode seem a good option, I voted for your answer). I am concerned that, for another encoding, some database mendors have some differences... Even though software vendors say they support everything, I have often got bitten, once you look a bit deeper than the surface. "The Devil lies in the details".
KLE
+4  A: 

Why not? You DB model is broken beyond repair already, so why not introduce another source of problems? ;)

More seriously, databases are getting better at supporting Unicode, so there is no problem with storing natural text (with all it's oddities). Your issue is "primary key". There are several ways in which the same text can be encoded (for example, you can have accented characters or diacritics with plain characters). This means you can get two different keys for the same text.

There are a lot of wrong reasons to use business keys as PK and no good ones. Don't do it. Bite the bullet and fix it. Fix it now. It will cost you less (even if it costs a lot) than not fixing it.

Aaron Digulla
+1 thanks for useful additionnal details
KLE
>Why not? You DB model is broken beyond repair already, so why not introduce another source of problems? ;)I don't see what's funny ;)>There are several ways in which the same text can be encoded (for example, you can have accented characters or diacritics with plain characters).I didn't know that, good point.>Fix it now. It will cost you less (even if it costs a lot) than not fixing it.I'll ask the software company for a fix. But the whle application is full of it :(
Robert Merkwürdigeliebe
My comment is meant to be funny; your situation certainly isn.t As for the being afraid to touch it: This will hurt you until you fix it. And the hurting will get worse every day. It will cost you a lot of money and tears to maintain it. Don't do it. Sit down for a week and fix it. It's worth it.
Aaron Digulla
+2  A: 

Yes you will have problems with those characters. Leaving ASCII always causes problems. But when you do business not only in britain and the US, you don't have a choice.

I don't see special character set related problems for the Primary Key. If you export, import, interface or migrate you'll have to take these characters into account no matter if they are part of your PK or not.

But they do emphasize the problem of a natural key as primary key. It seems to be extremely likely that someone will write e.g. Muenchen just to later change it to München, which of course will cause the well known problem of updates on PK.

Jens Schauder
+4  A: 

Trying to ignore diacritics is just delaying the inevitable. Yes, you could save some issues in Eastern Europe. But you still can't deal with Greek city names. You'd need Unicode, and then there's no point anymore in misspelling Munchen/Muenchen; it's München.

That said, the entire notion that there's a single name for a city already breaks in Brussel aka Bruxelles, and that's Western Europe. So, they're fundamentally unsuitable for primary keys, no matter how you'd spell them.

MSalters
+1  A: 

Whether your attribute is (part of) a key or not has nothing to do with the issue.

You have issues of character set conversion with ANY data traffic to/from this attribute anyway, regardless of whether it's a key or not.

Yes, in order to encode "correctly", and have the best possible guarantee that your data will never get corrupted because of character set conversion issues, you need the Unicode character set and one of its encodings.

I do have some serious doubts about the table itself, incidentally. What do you do with Heidelberg, Germany and Heidelberg, South Africa ? Oxford, UK and Oxford, US, where there's even hardly a state without one ?

What kind of information depends on that key ? If there is none at all, then your table is more of a "variable type" than it is a "genuine table". In that case, you might just as well forget the table and make your cityname attributes just plain String.

If you are really required to produce some "canonical spellings" for citynames when exporting data from the database, then I'd advise to try and set up a "phonetic search table" in which "commonly used spellings" are linked to the "canonical spelling" you are required to produce. Expect a serious effort in getting such tables populated, however.

In that case, then in addition to the already mentioned München/Muenchen and Western/Greek alphabet issues, don't forget about the Liège/Luik/Lüttich (München/Munich) kind of issues.

Erwin Smout
I understand your doubts about the table itself. But the example I provided is a simplification of reality. The issue is that (a part of) the PK field contains (part) of the name of a university. Sorry for the confusion.
Robert Merkwürdigeliebe
A: 

Things change their names, or have their names changed for them. Cities, Universities, Parks, People .. all unsuitable as Primary Keys. Unique Key, maybe? Or part of a Unique Key?

David Aldridge