tags:

views:

7139

answers:

12

I have a table that was imported as all UPPER CASE and I would like to turn it into Proper Case. What script have any of you used to complete this?

Thank you.

+3  A: 

Here's a UDF that will do the trick...

create function ProperCase(@Text as varchar(8000))
returns varchar(8000)
as
begin
   declare @Reset bit;
   declare @Ret varchar(8000);
   declare @i int;
   declare @c char(1);

   select @Reset = 1, @i=1, @Ret = '';

   while (@i <= len(@Text))
    select @c= substring(@Text,@i,1),
               @Ret = @Ret + case when @Reset=1 then UPPER(@c) else LOWER(@c) end,
               @Reset = case when @c like '[a-zA-Z]' then 0 else 1 end,
               @i = @i +1
   return @Ret
end

You will still have to use it to update your data though.

Galwegian
This will blow up for non-english input.
Tomalak
+2  A: 

Just keep in mind that properly changing upper-case text to proper-case text may require manual corrections in some, well, cases. With names, for example: I do not appreciate applications that misspell my name.

Dave DuPlantis
There will be no function on earth that would get 'DAVE DUPLANTIS' right. Uppercasing data is a big WTF itself, because most of the time it is merely a presentational issue.
Tomalak
I know a Macdonald who becomes furious when he's styled as MacDonald. And I'd appreciate the proper casing of O'Keefe, too, please.
DOK
@Tomalak: quite right, which is why you should accept mixed-case data and preserve it so when the choice is yours. Totally agree about the WTF part ... particularly if you accept "international" characters.
Dave DuPlantis
It is a cultural issue, too. In my experience, the British and the French are into the habit of uppercasing surnames wherever they get a chance to. I just don't get it, it's adding no value.
Tomalak
+8  A: 

This function:

  • "Proper Cases" all "UPPER CASE" words that are delimited by white space
  • leaves "lower case words" alone
  • is working properly even for non-English alphabets
  • is portable in that it does not use fancy features of recent SQL server versions
  • you can easily change it to use NCHAR and NVARCHAR for unicode support,as well as any parameter length you see fit
  • white space definition is configurable

.

CREATE FUNCTION ToProperCase(@string VARCHAR(255)) RETURNS VARCHAR(255)
AS
BEGIN
  DECLARE @i INT           -- index
  DECLARE @l INT           -- input length
  DECLARE @c NCHAR(1)      -- current char
  DECLARE @f INT           -- first letter flag (1/0)
  DECLARE @o VARCHAR(255)  -- output string
  DECLARE @w VARCHAR(10)   -- characters considered as white space

  SET @w = '[' + CHAR(13) + CHAR(10) + CHAR(9) + CHAR(160) + ' ' + ']'
  SET @i = 0
  SET @l = LEN(@string)
  SET @f = 1
  SET @o = ''

  WHILE @i <= @l
  BEGIN
    SET @c = SUBSTRING(@string, @i, 1)
    IF @f = 1 
    BEGIN
     SET @o = @o + @c
     SET @f = 0
    END
    ELSE
    BEGIN
     SET @o = @o + LOWER(@c)
    END

    IF @c LIKE @w SET @f = 1

    SET @i = @i + 1
  END

  RETURN @o
END

Result:

dbo.ToProperCase('ALL UPPER CASE and    SOME lower ÄÄ ÖÖ ÜÜ ÉÉ ØØ ĈĈ ÆÆ')
-----------------------------------------------------------------
All Upper Case and   Some lower Ää Öö Üü Éé Øø Cc Ææ
Tomalak
This is definitely the most international friendly solution. It has my vote. The only assumption here is that a space separates the words.
Cervo
True. The question author has not been specific on that, but it would be easy to allow other white space as well. Line breaks should also be accounted for, but this also depends on the input. Thanks for the vote. :-)
Tomalak
Could it be the index should start at 1? The first substring( ,0,1) returns <empty>. I'm running sqlserver2005
jan
+1  A: 

I think you will find that the following is more efficient:

IF OBJECT_ID('dbo.ProperCase') IS NOT NULL
    DROP FUNCTION dbo.ProperCase
GO
CREATE FUNCTION dbo.PROPERCASE (
    @str VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
    SET @str = ' ' + @str
    SET @str = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE( @str, ' a', ' A'), ' b', ' B'), ' c', ' C'), ' d', ' D'), ' e', ' E'), ' f', ' F'), ' g', ' G'), ' h', ' H'), ' i', ' I'), ' j', ' J'), ' k', ' K'), ' l', ' L'), ' m', ' M'), ' n', ' N'), ' o', ' O'), ' p', ' P'), ' q', ' Q'), ' r', ' R'), ' s', ' S'), ' t', ' T'), ' u', ' U'), ' v', ' V'), ' w', ' W'), ' x', ' X'), ' y', ' Y'), ' z', ' Z')
    RETURN RIGHT(@str, LEN(@str) - 1)
END
GO

The replace statement could be cut and pasted directly into a SQL query. It is ultra ugly, however by replacing @str with the column you are interested in, you will not pay a price for an implicit cursor like you will with the udfs thus posted. I find that even using my UDF it is much more efficient.

Oh and instead of generating the replace statement by hand use this:

-- Code Generator for expression
DECLARE @x  INT,
    @c CHAR(1),
    @sql VARCHAR(8000)
SET @x = 0
SET @sql = '@str' -- actual variable/column you want to replace
WHILE @x < 26
BEGIN
    SET @c = CHAR(ASCII('a') + @x)
    SET @sql = 'REPLACE(' + @sql + ', '' ' + @c+  ''', '' ' + UPPER(@c) + ''')'
    SET @x = @x + 1
END
PRINT @sql

Anyway it depends on the number of rows. I wish you could just do s/\b([a-z])/uc $1/, but oh well we work with the tools we have.

NOTE you would have to use this as you would have to use it as....SELECT dbo.ProperCase(LOWER(column)) since the column is in uppercase. It actually works pretty fast on my table of 5,000 entries (not even one second) even with the lower.

In response to the flurry of comments regarding internationalization I present the following implementation that handles every ascii character relying only on SQL Server's Implementation of upper and lower. Remember, the variables we are using here are VARCHAR which means that they can only hold ASCII values. In order to use further international alphabets, you have to use NVARCHAR. The logic would be similar but you would need to use UNICODE and NCHAR in place of ASCII AND CHAR and the replace statement would be much more huge....

-- Code Generator for expression
DECLARE @x  INT,
    @c CHAR(1),
    @sql VARCHAR(8000),
    @count INT
SEt @x = 0
SET @count = 0
SET @sql = '@str' -- actual variable you want to replace
WHILE @x < 256
BEGIN
    SET @c = CHAR(@x)
    -- Only generate replacement expression for characters where upper and lowercase differ
    IF @x = ASCII(LOWER(@c)) AND @x != ASCII(UPPER(@c))
    BEGIN
     SET @sql = 'REPLACE(' + @sql + ', '' ' + @c+  ''', '' ' + UPPER(@c) + ''')'
     SET @count = @count + 1
    END
    SET @x = @x + 1
END
PRINT @sql
PRINT 'Total characters substituted: ' + CONVERT(VARCHAR(255), @count)

Basically the premise of the my method is trading pre-computing for efficiency. The full ASCII implementation is as follows:

IF OBJECT_ID('dbo.ProperCase') IS NOT NULL
    DROP FUNCTION dbo.ProperCase
GO
CREATE FUNCTION dbo.PROPERCASE (
    @str VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
    SET @str = ' ' + @str
SET @str =     REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@str, ' a', ' A'), ' b', ' B'), ' c', ' C'), ' d', ' D'), ' e', ' E'), ' f', ' F'), ' g', ' G'), ' h', ' H'), ' i', ' I'), ' j', ' J'), ' k', ' K'), ' l', ' L'), ' m', ' M'), ' n', ' N'), ' o', ' O'), ' p', ' P'), ' q', ' Q'), ' r', ' R'), ' s', ' S'), ' t', ' T'), ' u', ' U'), ' v', ' V'), ' w', ' W'), ' x', ' X'), ' y', ' Y'), ' z', ' Z'), ' š', ' Š'), ' œ', ' Œ'), ' ž', ' Ž'), ' à', ' À'), ' á', ' Á'), ' â', ' Â'), ' ã', ' Ã'), ' ä', ' Ä'), ' å', ' Å'), ' æ', ' Æ'), ' ç', ' Ç'), ' è', ' È'), ' é', ' É'), ' ê', ' Ê'), ' ë', ' Ë'), ' ì', ' Ì'), ' í', ' Í'), ' î', ' Î'), ' ï', ' Ï'), ' ð', ' Ð'), ' ñ', ' Ñ'), ' ò', ' Ò'), ' ó', ' Ó'), ' ô', ' Ô'), ' õ', ' Õ'), ' ö', ' Ö'), ' ø', ' Ø'), ' ù', ' Ù'), ' ú', ' Ú'), ' û', ' Û'), ' ü', ' Ü'), ' ý', ' Ý'), ' þ', ' Þ'), ' ÿ', ' Ÿ')
    RETURN RIGHT(@str, LEN(@str) - 1)
END
GO
Cervo
Yeah. Your alphabet has 26 characters only. Mine has more. How about Greek? Or Turkish?
Tomalak
I would argue that the other solutions do the same. But I have included the code generator. You can just add additional sections for your alphabet. You would have to have everything as NVARCHAR and use NCHAR and UNICODE to convert.
Cervo
I think that the more characters you have the less efficient it will get. But heck in some character sets each symbol represents an entire word so the whole proper case problem doesn't apply....
Cervo
Also I doubt anyone uses every unicode character at once. So probably just code generate the replace for the 2 or 3 alphabets you use.
Cervo
That doesn't mean you should use half-baked English-only solutions just because you expect the data "to be in your alphabet, probably". As far as the Latin writing system goes, there is lower/upper case defined for (almost) any character. Conversion functions are more about correctness than speed.
Tomalak
I will admit Tomalak your solution is better in that it lowercases all the characters except the first one in the word which is more portable. The other solution voted as the accepted answer seems to be english centric as well. In fact I need to vote yours up.
Cervo
But for the 99% of people who use only their own alphabet, it is better just to use a set based approach, especially if they have millions of records.
Cervo
IMHO, *that* is a dangerous misconception. Most of the time, one's own "view of the world" is incomplete. You end up creating solutions that are as limited as your experience/imagination, and that blow up when you encounter a case you have not thought of.
Tomalak
For example, I would never ever come up with my own implementation of UPPER() or LOWER(), because I know it will be wrong or incomplete. Even if my implementation was three orders of magnitude faster for ASCII characters than that of SQL Server.
Tomalak
My new generator depends entirely on SQL Server's Implementation of UPPER and LOWER along with the fact that there are 255 ASCII characters. For UNICODE you need to use unicode functions and the unicode upper limit (maybe not practical)...but here the ascii chars shouldn't change mostly.
Cervo
If they do change by applying a SQL Server patch, just rerun the code generator and re-create the function.
Cervo
How would you know? :-)
Tomalak
BTW: Your second approach accounts for all the 255 characters *in one code page*. How about the other code pages?
Tomalak
I'm going to have to bow out now (no time). But my guess is that you could generate one expression per code page and do an if statement based on it. It is complicated but it is probably just another loop. Assuming there is a way to check the code page a variable uses. The approach is the same
Cervo
Just with more options and if statements so as not to do all that replacing in one super replace statements... It is getting more complicated but the main approach is to trade storage (code storage) for speed as opposed to doing it all dynamically..
Cervo
You're still not going to make MacDonald or DuPlantis or O'Keefe happy.
DOK
Then again, nobody will, because once broken ("DUPLANTIS") *this* is impossible to fix.
Tomalak
+1  A: 

Tony Rogerson has a great post on this: link

And Galwegian looks to have posted a tribute to Jeff's function posted at SQLTeam: link

Nathan Skerl
A: 

Is it too late to go back and get the un-uppercased data?

The von Neumann's, McCain's, DeGuzman's, and the Johnson-Smith's of your client base may not like the result of your processing...

Also, I'm guessing that this is intended to be a one-time upgrade of the data? It might be easier to export, filter/modify, and re-import the corrected names into the db, and then you can use non-SQL approaches to name fixing...

Toybuilder
The whole naming issue came up as possible downsides where discussed - there is no indication that the question author refers to data that contains names.
Tomalak
+1  A: 

The link I posted above is a great option that addresses the main issue: that we can never programmatically account for all cases (Smith-Jones, von Haussen, John Smith M.D.), at least not in an elegant manner. Tony introduces the concept of an exception / break character to deal with these cases. Anyways, building on Cervo's idea (upper all lower chars preceded by space), the replace statements could be wrapped up in a single table based replace instead. Really, any low/up character combination could be inserted into @alpha and the statement would not change:

declare @str    nvarchar(8000)
declare @alpha  table (low nchar(1), up nchar(1))


set @str = 'ALL UPPER CASE and    SOME lower ÄÄ ÖÖ ÜÜ ÉÉ ØØ ĈĈ ÆÆ'

-- stage the alpha (needs number table)
insert into @alpha
    -- A-Z / a-z
    select      nchar(n+32),
                nchar(n)
    from        dbo.Number
    where       n between 65 and 90 or
       n between 192 and 223

-- append space at start of str
set @str = lower(' ' + @str)

-- upper all lower case chars preceded by space
select  @str = replace(@str, ' ' + low, ' ' + up) 
from    @Alpha

select @str
Nathan Skerl
Yet another US-ASCII only solution.
Tomalak
Do you still view this as US-centric solution as it references unicode chars. I know the original posting used A-Z ascii resultset, but the point of the solution is that its a table driven pairing of upper:lower chars. The replace statement just refers to the table.
Nathan Skerl
edit: I updated example to use your sample input. Thanks for any feedback
Nathan Skerl
Is there an "un-elegant" way of handling most known variations for names that are not hyphenated? Like 'Mc', O'C, and what not?
Merritt
+1  A: 

If you can enable the CLR in SQL Server (requires 2005 or later) then you could create a CLR function that uses the TextInfo.ToTitleCase built-in function which would allow you to create a culture-aware way of doing this in only a few lines of code.

Greg Beech
I have to vote here too. IT is internationally safe and uses someone else's library which is probably full of all sorts of checks. YOu can't go wrong here :)
Cervo
A: 

Here is another variation I found on the SQLTeam.com Forums @ http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=47718

create FUNCTION PROPERCASE
(
--The string to be converted to proper case
@input varchar(8000)
)
--This function returns the proper case string of varchar type
RETURNS varchar(8000)
AS
BEGIN
IF @input IS NULL
BEGIN
--Just return NULL if input string is NULL
RETURN NULL
END

--Character variable declarations
DECLARE @output varchar(8000)
--Integer variable declarations
DECLARE @ctr int, @len int, @found_at int
--Constant declarations
DECLARE @LOWER_CASE_a int, @LOWER_CASE_z int, @Delimiter char(3), @UPPER_CASE_A int, @UPPER_CASE_Z int

--Variable/Constant initializations
SET @ctr = 1
SET @len = LEN(@input)
SET @output = ''
SET @LOWER_CASE_a = 97
SET @LOWER_CASE_z = 122
SET @Delimiter = ' ,-'
SET @UPPER_CASE_A = 65
SET @UPPER_CASE_Z = 90

WHILE @ctr <= @len
BEGIN
--This loop will take care of reccuring white spaces
WHILE CHARINDEX(SUBSTRING(@input,@ctr,1), @Delimiter) > 0
BEGIN
SET @output = @output + SUBSTRING(@input,@ctr,1)
SET @ctr = @ctr + 1
END

IF ASCII(SUBSTRING(@input,@ctr,1)) BETWEEN @LOWER_CASE_a AND @LOWER_CASE_z
BEGIN
--Converting the first character to upper case
SET @output = @output + UPPER(SUBSTRING(@input,@ctr,1))
END
ELSE
BEGIN
SET @output = @output + SUBSTRING(@input,@ctr,1)
END

SET @ctr = @ctr + 1

WHILE CHARINDEX(SUBSTRING(@input,@ctr,1), @Delimiter) = 0 AND (@ctr <= @len)
BEGIN
IF ASCII(SUBSTRING(@input,@ctr,1)) BETWEEN @UPPER_CASE_A AND @UPPER_CASE_Z
BEGIN
SET @output = @output + LOWER(SUBSTRING(@input,@ctr,1))
END
ELSE
BEGIN
SET @output = @output + SUBSTRING(@input,@ctr,1)
END
SET @ctr = @ctr + 1
END

END
RETURN @output
END



GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO
Merritt
+1  A: 

Here is a version that uses a sequence or numbers table rather than a loop. You can modify the WHERE clause to suite your personal rules for when to convert a character to upper case. I have just included a simple set that will upper case any letter that is proceeded by a non-letter with the exception of apostrophes. This does how ever mean that 123apple would have a match on the "a" because "3" is not a letter. If you want just white-space (space, tab, carriage-return, line-feed), you can replace the pattern '[^a-z]' with '[' + Char(32) + Char(9) + Char(13) + Char(10) + ']'.


CREATE FUNCTION String.InitCap( @string nvarchar(4000) ) RETURNS nvarchar(4000) AS
BEGIN

-- 1. Convert all letters to lower case
    DECLARE @InitCap nvarchar(4000); SET @InitCap = Lower(@string);

-- 2. Using a Sequence, replace the letters that should be upper case with their upper case version
    SELECT @InitCap = Stuff( @InitCap, n, 1, Upper( SubString( @InitCap, n, 1 ) ) )
    FROM (
     SELECT (1 + n1.n + n10.n + n100.n + n1000.n) AS n
     FROM       (SELECT 0 AS n UNION SELECT    1 UNION SELECT    2 UNION SELECT    3 UNION SELECT    4 UNION SELECT    5 UNION SELECT    6 UNION SELECT    7 UNION SELECT    8 UNION SELECT    9) AS    n1
     CROSS JOIN (SELECT 0 AS n UNION SELECT   10 UNION SELECT   20 UNION SELECT   30 UNION SELECT   40 UNION SELECT   50 UNION SELECT   60 UNION SELECT   70 UNION SELECT   80 UNION SELECT   90) AS   n10
     CROSS JOIN (SELECT 0 AS n UNION SELECT  100 UNION SELECT  200 UNION SELECT  300 UNION SELECT  400 UNION SELECT  500 UNION SELECT  600 UNION SELECT  700 UNION SELECT  800 UNION SELECT  900) AS  n100
     CROSS JOIN (SELECT 0 AS n UNION SELECT 1000 UNION SELECT 2000 UNION SELECT 3000)                                                                                                             AS n1000
     ) AS Sequence
    WHERE 
     n BETWEEN 1 AND Len( @InitCap )
    AND SubString( @InitCap, n, 1 ) LIKE '[a-z]'                 /* this character is a letter */
    AND (
     n = 1                                                    /* this character is the first `character` */
     OR SubString( @InitCap, n-1, 1 ) LIKE '[^a-z]'           /* the previous character is NOT a letter */
     )
    AND (
     n < 3                                                    /* only test the 3rd or greater characters for this exception */
     OR SubString( @InitCap, n-2, 3 ) NOT LIKE '[a-z]''[a-z]' /* exception: The pattern <letter>'<letter> should not capatolize the letter following the apostrophy */
     )

-- 3. Return the modified version of the input
    RETURN @InitCap

END
Dennis Allen
A: 

the title case sql server function may help u out...

yogesh
A: 

It whould be awesome if any of this functions would take into account roman numerals.

Victor Bello