views:

77

answers:

5

I need to do the following modifications to a varchar(20) field:

  1. substitute accents with normal letters (like è to e)
  2. after (1) remove all the chars not in a..z

for example

'aèàç=.32s df' 

must become

'aeacsdf'

are there special stored functions to achieve this easily?

UPDATE: please provide a T-SQL not CLR solution. This is the workaround I temporarly did because it temporarly suits my needs, anyway using a more elegant approach would be better.

CREATE FUNCTION sf_RemoveExtraChars (@NAME nvarchar(50))
RETURNS nvarchar(50)
AS
BEGIN
  declare @TempString nvarchar(100)
  set @TempString = @NAME 
  set @TempString = LOWER(@TempString)
  set @TempString =  replace(@TempString,' ', '')
  set @TempString =  replace(@TempString,'à', 'a')
  set @TempString =  replace(@TempString,'è', 'e')
  set @TempString =  replace(@TempString,'é', 'e')
  set @TempString =  replace(@TempString,'ì', 'i')
  set @TempString =  replace(@TempString,'ò', 'o')
  set @TempString =  replace(@TempString,'ù', 'u')
  set @TempString =  replace(@TempString,'ç', 'c')
  set @TempString =  replace(@TempString,'''', '')
  set @TempString =  replace(@TempString,'`', '')
  set @TempString =  replace(@TempString,'-', '')
  return @TempString
END
GO
A: 

you may use select replace(myfield, 'aèàç=.32s df', 'aeacsdf') from mytable

heximal
Are you going to write such a query for all the possible cases?
StuffHappens
agree. in this case it's better to use trigger. but at present moment i'm not ready to give complete solution
heximal
what I need is a function name, like Lower(UserName), then it can be used inside a sp, trigger, update statement....
A: 

AFAIK, there isn't a direct mapping for unicode/UTF-X characters that "look similar". Unless someone has something much cooler, I'd suggest pursuing a brute-force approach so you can get your work done until then.

It sounds like you need to do 2 passes. The first pass would be to replace letters that look similar first, then go through and remove all remaining non-English letters second.

This article can help you create a user defined function so that you can use regular expressions instead of dozens of REPLACE calls: http://msdn.microsoft.com/en-us/magazine/cc163473.aspx

Here is a dictionary that I've been using for this case:

    public static Dictionary<char, string> NonEnglishLetterMapping = new Dictionary<char, string>
    {
          {'a', "áàâãäåāăą"}
        //, {'b', ""}
        , {'c', "ćĉċč"}
        , {'d', "ďđ"}
        , {'e', "éëêèēĕėę"}
        //, {'f', ""}
        , {'g', "ĝğġģ"}
        , {'h', "ĥħ"}
        , {'i', "ìíîïĩīĭįı"}
        , {'j', "ĵ"}
        , {'k', "ķĸ"}
        , {'l', "ĺļľŀł"}
        //, {'m', ""}
        , {'n', "ñńņňʼnŋ"}
        , {'o', "òóôõöōŏőơ"}
        //, {'p', ""}
        //, {'q', ""}
        , {'r', "ŕŗř"}
        , {'s', "śŝşšș"}
        , {'t', "ţťŧț"}
        , {'u', "ùúûüũūŭůűųư"}
        //, {'v', ""}
        , {'w', "ŵ"}
        //, {'x', ""}
        , {'y', "ŷ"}
        , {'z', "źżž"}
    };
soslo
Yes, thanks for the idea, anyway I am trying to obtain the desired result in T-SQL, without using CLR functions
+2  A: 

Well, this isn't a whole lot better, but it's at least a tsql set solution

declare @TempString varchar(100)

set @TempString='textàè containing éìòaccentsç''''` and things-'

select @TempString=
    replace(
        replace(
            replace(
                replace(
                    replace(
                        replace(
                            replace(
                                replace(
                                    replace(
                                        replace(
                                            replace(@TempString,' ', '') 
                                        ,'à', 'a')
                                    ,'è', 'e') 
                                ,'é', 'e')
                            ,'ì', 'i')
                        ,'ò', 'o') 
                    ,'ù', 'u') 
                ,'ç', 'c') 
            ,'''', '') 
        ,'`', '')
    ,'-', '') 



select @TempString
DForck42
+2  A: 

Let me clarify something first: the accented characters you show are not actually Unicode (as one answer implies); these are 8-bit ASCII characters. One thing to keep in mind: you see characters like è and à simply because this is how your code page (the code page used by your OS and/or SQL Server [I'm not sure which one]) displays them. In a different code page, these characters would be represented by totally different symbols (e.g. if you use a Cyrillic or Turkish code page).

Anyway, say you want to replace these 8-bit chars with the closest US/Latin character equivalent for your default code page [I assume these are characters from some variation of a Latin character set]. This is how I approached a similar problem (disclaimer: this is not a very elegant solution, but I could not think of anything better at the time):

Create a UDF to translate an 8-bit ASCII character to a 7-bit printable ASCII equivalent, such as:

CREATE FUNCTION dbo.fnCharToAscii
(
  @Char AS VARCHAR
)
RETURNS
  VARCHAR   
AS
BEGIN
IF (@Char IS NULL)
  RETURN ''

-- Process control and DEL chars.
IF (ASCII(@Char) < 32) OR (ASCII(@Char) = 127)
    RETURN ''

-- Return printable 7-bit ASCII chars as is.
-- UPDATE TO DELETE NON-ALPHA CHARS.
IF (ASCII(@Char) >= 32) AND (ASCII(@Char) < 127)
    RETURN @Char

-- Process 8-bit ASCII chars.
RETURN
  CASE ASCII(@Char)
    WHEN 128 THEN 'E'
    WHEN 129 THEN '?'
    WHEN 130 THEN ','
    WHEN 131 THEN 'f'
    WHEN 132 THEN ','
    WHEN 133 THEN '.'
    WHEN 134 THEN '+'
    WHEN 135 THEN '+'
    WHEN 136 THEN '^'
    WHEN 137 THEN '%'
    WHEN 138 THEN 'S'
    WHEN 139 THEN '<'
    WHEN 140 THEN 'C'
    WHEN 141 THEN '?'
    WHEN 142 THEN 'Z'
    WHEN 143 THEN '?'
    WHEN 144 THEN '?'
    WHEN 145 THEN ''''
    WHEN 146 THEN ''''
    WHEN 147 THEN '"'
    WHEN 148 THEN '"'
    WHEN 149 THEN '-'
    WHEN 150 THEN '-'
    WHEN 151 THEN '-'
    WHEN 152 THEN '~'
    WHEN 153 THEN '?'
    WHEN 154 THEN 's'
    WHEN 155 THEN '>'
    WHEN 156 THEN 'o'
    WHEN 157 THEN '?'
    WHEN 158 THEN 'z'
    WHEN 159 THEN 'Y'
    WHEN 160 THEN ' '
    WHEN 161 THEN 'i'
    WHEN 162 THEN 'c'
    WHEN 163 THEN 'L'
    WHEN 164 THEN '?'
    WHEN 165 THEN 'Y'
    WHEN 166 THEN '|'
    WHEN 167 THEN '$'
    WHEN 168 THEN '^'
    WHEN 169 THEN 'c'
    WHEN 170 THEN 'a'
    WHEN 171 THEN '<'
    WHEN 172 THEN '-'
    WHEN 173 THEN '-'
    WHEN 174 THEN 'R'
    WHEN 175 THEN '-'
    WHEN 176 THEN 'o'
    WHEN 177 THEN '+'
    WHEN 178 THEN '2'
    WHEN 179 THEN '3'
    WHEN 180 THEN ''''
    WHEN 181 THEN 'm'
    WHEN 182 THEN 'P'
    WHEN 183 THEN '-'
    WHEN 184 THEN ','
    WHEN 185 THEN '1'
    WHEN 186 THEN '0'
    WHEN 187 THEN '>'
    WHEN 188 THEN '?'
    WHEN 189 THEN '?'
    WHEN 190 THEN '?'
    WHEN 191 THEN '?'
    WHEN 192 THEN 'A'
    WHEN 193 THEN 'A'
    WHEN 194 THEN 'A'
    WHEN 195 THEN 'A'
    WHEN 196 THEN 'A'
    WHEN 197 THEN 'A'
    WHEN 198 THEN 'A'
    WHEN 199 THEN 'C'
    WHEN 200 THEN 'E'
    WHEN 201 THEN 'E'
    WHEN 202 THEN 'E'
    WHEN 203 THEN 'E'
    WHEN 204 THEN 'I'
    WHEN 205 THEN 'I'
    WHEN 206 THEN 'I'
    WHEN 207 THEN 'I'
    WHEN 208 THEN 'D'
    WHEN 209 THEN 'N'
    WHEN 210 THEN 'O'
    WHEN 211 THEN 'O'
    WHEN 212 THEN 'O'
    WHEN 213 THEN 'O'
    WHEN 214 THEN 'O'
    WHEN 215 THEN 'x'
    WHEN 216 THEN 'O'
    WHEN 217 THEN 'U'
    WHEN 218 THEN 'U'
    WHEN 219 THEN 'U'
    WHEN 220 THEN 'U'
    WHEN 221 THEN 'Y'
    WHEN 222 THEN 'b'
    WHEN 223 THEN 'B'
    WHEN 224 THEN 'a'
    WHEN 225 THEN 'a'
    WHEN 226 THEN 'a'
    WHEN 227 THEN 'a'
    WHEN 228 THEN 'a'
    WHEN 229 THEN 'a'
    WHEN 230 THEN 'a'
    WHEN 231 THEN 'c'
    WHEN 232 THEN 'e'
    WHEN 233 THEN 'e'
    WHEN 234 THEN 'e'
    WHEN 235 THEN 'e'
    WHEN 236 THEN 'i'
    WHEN 237 THEN 'i'
    WHEN 238 THEN 'i'
    WHEN 239 THEN 'i'
    WHEN 240 THEN 'o'
    WHEN 241 THEN 'n'
    WHEN 242 THEN 'o'
    WHEN 243 THEN 'o'
    WHEN 244 THEN 'o'
    WHEN 245 THEN 'o'
    WHEN 246 THEN 'o'
    WHEN 247 THEN '-'
    WHEN 248 THEN 'o'
    WHEN 249 THEN 'u'
    WHEN 250 THEN 'u'
    WHEN 251 THEN 'u'
    WHEN 252 THEN 'u'
    WHEN 253 THEN 'y'
    WHEN 254 THEN 'b'
    WHEN 255 THEN 'y'
  END
RETURN ''
END

The code above is general-purpose, so you can adjust the character mappings to remove all non-alphabetic characters, e.g. you can use code like this in the match for printable 7-bit ASCII character (this assumes case-insensitive collation):

IF @Char NOT LIKE '[a-z]' RETURN ''

To see if your character mapping for 8-bit ASCII symbols works correctly, run the following code:

DECLARE @I   INT
DECLARE @Msg VARCHAR(32)

SET @I = 128

WHILE @I < 256
BEGIN
    SELECT @Msg = CAST(@I AS VARCHAR) + 
    ': ' + 
    CHAR(@I) + 
    '=' + 
    dbo.fnCharToAscii(CHAR(@I))
    PRINT @Msg
    SET @I = @I + 1 
END

Now you can create a UDF to process a string:

CREATE FUNCTION dbo.fnStringToAscii
(
  @Value AS VARCHAR(8000)
)
RETURNS
  VARCHAR(8000) 
AS
BEGIN
IF (@Value IS NULL OR DATALENGTH(@Value) = 0)
  RETURN ''

DECLARE @Index  INT
DECLARE @Result VARCHAR(8000)

SET @Result = ''
SET @Index  = 1

WHILE (@Index <= DATALENGTH(@Value))
BEGIN
  SET @Result = @Result + dbo.fnCharToAscii(SUBSTRING(@Value, @Index, 1))
  SET @Index = @Index + 1   
END

RETURN @Result
END
GO
Alek Davis
+2  A: 

What you are looking for is something to remove Diacritics from individual characters. I'm afraid the solution you have is going to be almost as good as you can get, at least with pure SQL. dotNet/CLR does provide a simple method for doing this, though. Sorry, I know you are wanting to avoid another CLR solution, but Microsoft SQL Server doesn't provide a T-SQL equivalent for this.

If you're lucky, you've have the collation set in your database as "SQL_Latin1_General_CP1_CI_AS" or any variant starting with "SQL_Latin1_General". This is equivalent to Windows-1252 which is very well documented. You'll be able to "translate" each character to an English equivalent by reviewing the characters and mapping an equivalent using a SQL CASE statement like you have been.

I do have one quick correction for your code, though. You are going to want to use varchar in your variables and parameters. It creates additional overhead performing data type conversion back and forth and has the potential for introducing unicode characters that only exist as unicode into the mix. Plus, a security related reason specific to your situation can be found on Bruce Schneier's blog.

Update Some great information on Diacritics and Windows internationalization can be found on Michael S Kaplan's blog.

John L Veazey