views:

89

answers:

1

Let's say I've got a database full of music artists. Consider the following artists:

The Beatles - "The" is officially part of the name, but we don't want to sort it with the "T"s if we are alphabetizing. We can't easily store it as "Beatles, The" because then we can't search for it properly.

Beyoncé - We need to allow the user to be able to search for "Beyonce" (without the diacritic mark)and get the proper results back. No user is going to know how or take the time to type the special diacritcal character on the last "e" when searching, yet we obviously want to display it correctly when we need to output it.

What is the best way around these problems? It seems wasteful to keep an "official name", a "search name", and a "sort name" in the database since a very large majority of entries will all be exactly the same, but I can't think of any other options.

+2  A: 

The library science folks have a standard answer for this. The ALA Filing Rules cover all of these cases in a perfectly standard way.

You're talking about the grammatical sort order. This is a debatable topic. Some folks would take issue with your position.

Generally, you transform the title to a normalized form: "Beatles, The". Generally, you leave it that way. Then sort.

You can read about cataloging rules here: http://en.wikipedia.org/wiki/Library_catalog#Cataloging_rules

For "extended" characters, you have several choices. For some folks, é is a first-class letter and the diacritical is part of it. They aren't confused. For other folks, all of the diacritical characters map onto unadorned characters. This mapping is a feature of some Unicode processing tools.

You can read about Unicode diacritical stripping here: http://lexsrv3.nlm.nih.gov/SPECIALIST/Projects/lvg/current/docs/designDoc/UDF/unicode/NormOperations/stripDiacritics.html

http://blogs.msdn.com/michkap/archive/2005/02/19/376617.aspx

S.Lott