Normalization of Strings With String.ToUpperInvariant()

views:

1517

answers:

+5 Q:

Normalization of Strings With String.ToUpperInvariant()

I am currently storing normalized versions of strings in my SQL Server database in lower case. For example, in my Users table, I have a UserName and a LoweredUserName field. Depending on the context, I either use T-SQL's LOWER() function or C#'s String.ToLower() method to generate the lower case version of the user name to fill the LoweredUserName field. According to Microsoft's guidelines and Visual Studio's code analysis rule CA1308, I should be using C#'s String.ToUpperInvariant() instead of ToLower(). According to Microsoft, this is both a performance and globalization issue: converting to upper case is safe, while converting to lower case can cause a loss of information (for example, the Turkish 'I' problem).

If I move to using ToUpperInvariant for string normalization, I will have to change my database schema as well, since my schema is based on Microsoft's ASP.NET Membership framework (see this related question), which normalizes strings to lower case.

Isn't Microsoft contradicting itself by telling us to use upper case normalization in C#, while it's own code in the Membership tables and procedures is using lower case normalization? Should I switch everything to upper case normalization, or just continue using lower case normalization?

+3 A:

To answer your first question, yes Microsoft is a bit inconsistent. To answer your second question, no do not switch anything until you have confirmed that this is causing a bottleneck in your application.

Think how much forward progress you can make on you project instead of wasting time switching everything. Your development time is much more valuable than the savings you would get from such a change.

Remember:

Premature optimization is the root of all evil (or at least most of it) in programming. - Donald Knuth

Andrew Hare 2009-04-21 17:38:03

This isn't just a performance issue, it is also a globalization issue. According to Microsoft, converting to upper case is safe, while converting to lower case can cause a loss of information (for example, in Turkish 'I' problem).

Kevin Albrecht 2009-04-21 17:43:19

@Kevin, the Turkish/Azeri dotless I problem remains a special case whichever approach is used (they uppercase i to İ and ı to I), though lowercasing is ambiguous for SS (should it be ss or ß) but that's imperfect too (some orthographies still uppercase ß to SZ). It is still better though. Better still is to use Unicode case-folding rules with a Turkic switch for i and ı, but it still won't be perfect, that can only be per locale :(

Jon Hanna 2010-10-27 23:26:05

+3 A:

According to CA1308, the reason to do this is that some characters cannot be roundtrip converted from upper to lower case. The important thing is that you always move in one direction, so if your standard is to always move to lower case then there is no reason to change it.

JoshBerke 2009-04-21 17:44:09

I like this approach. If starting from scratch, following the recommended standard is always the best practise in light of no other motivation to do otherwise, but when working on existing maintenance, it's often folly just to switch because it says so. You need compelling evidence that your project will benefit from the change before embarking on such an overhaul - maybe when you start processing Turkish and hit a problem?

Jeff Yates 2009-04-21 17:50:48

I totally agree, Jeff, there are some guidance which you should follow and I would say might be worth upgrading existing code to follow (making sure you dispose your data reader for example). This however isn't one of those rules nor is it one that is even close.

JoshBerke 2009-04-21 18:08:25

Continue using lower case normalization. Only change to conform to Microsoft standards if a large issue develops.

This is unfortunate, but worthwhile. Sadly, Microsoft "standards" tend to be poorly considered and somewhat less than consistent; experience with them has shown that unless there is a compelling reason, it's best to simply stick with what works while it works. Note that this is generally NOT true of non-Microsoft technologies; but the arbitrariness of the Microsoft "standards" makes them worth avoiding.

Edit: I should clarify here; my opinion of Microsoft is very low, from long experience with their standards. As was pointed out in the comments, I don't have particular references to point out about "everybody else other than Microsoft"; this just comes from my personal experience. Your Mileage May Vary widely. This answer should be considered really just my opinion. Sorry for not making that more clear earlier.

McWafflestix 2009-04-21 17:44:14

I think you need to cite some sources before making claims of "everyone but Microsoft" when it comes to standards. In recent years, Microsoft seem to take great care in researching the motivations behind their standards and while their implementation of web standards in IE has been far from ideal, the standards they define for us to work within their products are often excellent. Please, back up your statements, lest they be interpreted as bitter opinion.

Jeff Yates 2009-04-21 17:49:00

I agree Jeff, their standards are very consistent their adoption of their standards are less so but this is expected, code that was written prior to a standard being adopted won't be updated just to bring it into adherance, Imagine if they changed all their name spaces to reflect their new approach for picking namespaces and all the developers who would scream bloody murder.

JoshBerke 2009-04-21 17:52:02

Your points are both good; my position does in fact come from pretty much bitter opinion and lots and lots of crappy experience with Microsoft. I'll update to reflect that.

McWafflestix 2009-04-21 18:53:23

ansaurus

tags:

views:

answers:

Normalization of Strings With String.ToUpperInvariant()

related questions