views:

3544

answers:

6

I have a problem where I am storing a UTF8 string in SQL Server as USC2, when I pull it out to display on a page with content-type set to UTF-8 it works fine. But I have a third party javascript component which when I pass it the string for the database it renders it as USC2. or not UTF8.

Is there a way in ASP to convert this string to UTF-8 after reading it from the database to pass it to the third party component ( obfuscated ) .

Hope this makes sense.

A: 

Encoding.UTF8 and Encoding.Unicode will provide enough functionallity. For more information see Wikipedia

crauscher
-1 This is classic ASP, not ASP.NET.
Andrew Hare
+8  A: 

My suspicion is you are falling foul of the classic form post character encoding mismatch problem.

It goes like this:-

  • You have a form which is presented to the client using the UTF-8 encoding.
  • As a result the browser posts text values entered into the form using UTF-8 encoding.
  • The action page receiving the post has its Response.Codepage set to a typical OEM codepage such as 1252.
  • Each byte of the posted UTF-8 string is treated by server as an individual character rather than decoding sets of UTF-8 encoded bytes to the correct unicode character.
  • The string is stored in the DB with the now corrupted characters.
  • A page wishes to present to the client the content of a DB field containing the corrupted characters.
  • The page sets it CharSet to UTF-8 but its Response.CodePage remains at the OEM codepage such as 1252.
  • Response.Write is used to send the field content to the client, the unicode characters are transformed back to the byte for byte set as was received in the ealier post.
  • The client thinks its getting UTF-8 hence it decodes the characters received from the server as UTF-8 just as they were originally hence they appear on screen correctly.
  • Everything proceeds fine as if all is ok whilst these characters are simply being bounced back and forth through ASP. A bug in one page has a matching bug in the other (could be the same page) which makes everything look fine.

If you examine the field contents directly with SQL server tools you will likely see the corrupted strings there. Now that you want to use this string with another component which is expecting a straight-forward unicode string this is where you discover this bug.

The solution is to always ensure all your pages not only send CharSet = "UTF-8" in the response but also use Response.CodePage = 65001 before using Response.Write and before attempting to read any Request.Form values. Use Codepage directive in the <%@ page header.

Now you are left with repairing the corrupt strings already in your DB.

Use an ADODB.Stream:-

Function ConvertFromUTF8(sIn)

 Dim oIn: Set oIn = CreateObject("ADODB.Stream")

 oIn.Open
 oIn.CharSet = "WIndows-1252"
 oIn.WriteText sIn
 oIn.Position = 0
 oIn.CharSet = "UTF-8"
 ConvertFromUTF8 = oIn.ReadText
 oIn.Close

End Function

This function (which BTW is the answer to your actual question) takes a corrupted string (one that has the byte of byte representation) and converts to the string it should have been. You need to apply this transform to every field in the DB that has fallen victim to the bug.

AnthonyWJones
A: 

I just want to thank Anthony for above function, it help'd me solving problem, that stuck me for 5 days. I'm php guy mostly, and I had pain figuring out how to convert charsets in freakin' vbscript.

Thanks man :)

Popara
A: 

I got the same problem.

I used your function, it's work.

Thank you so much.

Kawinl
A: 

Anthony many thanks for your post that give the clue I needed to solve my issue. I'm connecting (using C#) to phpBB mssql database to retrieve the latest posts and I have problems with characters. Strangely everything woks fine on the PHP side, the DB collation is the right one but PHP is storing everything wrong, I don't know why.

// This is the corrupted string:
string s_unicode = "joão";

// Convert a string to Windows-1252 bytes.
byte[] s_bytes = System.Text.Encoding.GetEncoding("Windows-1252").GetBytes(s_unicode);

// Convert utf-8 bytes to a string.
string s_unicode2 = System.Text.Encoding.GetEncoding("UTF-8").GetString(s_bytes);

The result is joão as espected.

Rui Marques