views:

552

answers:

2

I need to convert a value "Convert value into a a URL friendly format - Unicode decomposition ähhh" into "convert-value-into-a-url-friendly-format-unicode-decomposition-ahhh". Is this possible in SQL-Server? All Unicode - Characters should be handled.

I use SQL-Server 2005, 2008 as an option.

EDIT

Bogdan had a solution that worked for me.

The query depends on the characters you need to handle, but for the most cases it should be ok. You realy need to pass in a collation that does not have the characters you need to change. Cyrillic is nice for that. This is kind of hacky...

declare @input nvarchar(4000) set @input = 'áâãäåæçèéêëìíîïðñòóôõöùúûüýÿāăąćĉċčĕėęěĝğġģĥħĩīĭįĵķļľńňŋōŏőŕřśŝşšťũūŭůűŵŷźžǻǽǿ'

SELECT CAST(@input as char(4000) ) COLLATE Cyrillic_General_CI_AS

A: 

Yes, it is possible. The answer is 'Scalar-valued User-Defined Function' (UDF).

I see two options here:

  1. Create a UDF in T-SQL - requires quite a bit of efforts, a lot of work with character codes [I quess] and will be 'not that fast'.
  2. Create a CLR UDF - a whole lot faster and simpler in case you're familiar with .NET.

Second option will require you to allow CLR integration in SQL Server besides creating an assembly with the function and deploying it to the server(s):

exec sp_configure 'clr enabled', 1
RECONFIGURE

AlexS
+1  A: 

Here is a simple URL encoding function (it uses varchar as parameter) I found long time ago on some forum

create function urlencode(@str as varchar(4000))
returns varchar(4000)
as
begin
declare @hex char(16)
declare @c char(1)
set @hex='0123456789ABCDEF'
declare @ostr varchar(4000)
set @ostr=''
declare @l int
set @l = 1
while @l <= len(@str)
begin
 set @c = substring(@str,@l,1)
 if @c between '0' and '9'
 or @c between 'A' and 'Z'
 or @c between 'a' and 'z'
  set @ostr = @ostr + @c
  else
  set @ostr = @ostr + '%' +
    substring(@hex,(ascii(@c)/16)+1,1)
   +substring(@hex,(ascii(@c)&15)+1,1)
 set @l=@l+1
end
return @ostr
end
go

How will you handle unicode? Well, it's quite straightforward if you don't care about Hindu or Arabic symbols but do care about Central European languages. Just what you need is to use CAST(@nvarchar as varchar) function.

Lets check how this work with some Central European symbols. Run the following example in

declare @t1 nvarchar(256)
select @t1 = N'áâãäåæçèéêëìíîïðñòóôõöùúûüýÿāăąćĉċčĕėęěĝğġģĥħĩīĭįĵķļľńňŋōŏőŕřśŝşšťũūŭůűŵŷźžǻǽǿ'
select @t1
declare @t2 varchar(512)
select @t2 = cast(@t1 as varchar(512))
select @t2

And see what output we will get

áâãäåæçèéêëìíîïðñòóôõöùúûüýÿāăąćĉċčĕėęěĝğġģĥħĩīĭįĵķļľńňŋōŏőŕřśŝşšťũūŭůűŵŷźžǻǽǿ
aaaaa?ceeeeiiii?nooooouuuuyyaaacccceeeegggghhiiiijkllnn?ooorrsssstuuuuuwyzz???

So, most symbols converted perfectly, while several symbols will be question marks. If you care about such symbols (such as æ, ð, ŋ) you need to write an additional function that will replace them before conversion to something that you will find most appropriate for them (sometimes 2 symbols instead of one, for example æ => ae).

To replace you can use REPLACE() function, but you should understand that if you call it too many times, the performance will suffer. So if you have lot of character replacements, you can use the following algorithm

1) Create a temporary table (or table type variable) with 3 columns - position int identity(0,1) primary key clustered, original nchar(1) not null, converted varchar(2) null 2) Using loop and SUBSTRING() function split string into characters and insert each char to original column of this temporary table 3) Using one query with many WHEN THEN statements convert all symbols

update @temp_table
set converted = CASE original 
     WHEN N'æ' THEN 'ae' 
     WHEN N'ŋ' THEN 'n'
     ... and so on ...
     ELSE CAST(original AS VARCHAR(2))

4) Using loop, concatenate results that you have in converted column into one varchar() variable.

When you converted nvarchar() to varchar(), call the urlencode() function I listed above.

I understand that this case will require a lot of WHEN/THEN, but it depends on what langauges you have currently. As you see, for most European symbols CAST to varchar gives perfect result.

If you will go with CLR function implementation (on C#), you will have to write a lot of switch/case statements too. So comparing both approaches, both will require same development efforts, but CLR solution will require additional administrative actions. For small strings CLR solution will work slowly (because SQL server requires some time to interop with CLR environment to do the call and then get the results back) while for big strings with lots of replacements C# maybe (never checked this!) could be faster because SQL is not the best language for string manipulations.

Bogdan_Ch
This looks pretty promissing. But my result is slightly different: á gets á, but ā gets a. Which version of SQL-Server you use?
Malcolm Frexner
I think it has to do with the collation of the server. Can you tell me which collation your server uses?
Malcolm Frexner
The database where I tested it uses Cyrillic_General_CI_AS
Bogdan_Ch
well, you are right, collations are important becuase if you select a collation where a character exist , it will be converted from unicode without any changes... so you will have to use a collation where extended latin symbols not exist and that is why they are converted to standard latin. Cyrillic collation is a good choice :)
Bogdan_Ch