views:

2187

answers:

9

So I've got an interesting problem that I need help with faster than I can get my skills with SQL Server up to par.

We have a table that contains a bunch of text, all of it in different languages. Most of this data appears correctly in the browser, however, anything in Chinese or Japanese gets completely mangled by the browser.

This is an ASP.old app that we're using to display the data which comes from a server running MS SQL Server 2005.

Before, we had this same problem and we solved it by changing the encoding in the ASP pages. These files have not changed since we did that, but the problem has resurfaced. Thus I must conclude that the problem rests with the database since that's the only thing that has been updated since we last fixed it.

So far I've been trying to look into collation, but I'm nowhere near an expert on SQL so it's been difficult.

I can provide more info if needed, anything that will help someone get me to the answer, short of URL's (confidentiality and all).

If anyone has any ideas, I would appreciate it very much.

ADDITIONAL INFO:

-column type is 'ntext'

+3  A: 

Collation only affects sort order, not encoding. You need to determine what the encoding of your chinese and japenese content is (see this). If it is not UCS-2, you have a problem (since you cannot support multiple page encodings concurrently). If it is UCS-2, you need to make sure that the encoding of your ASP page is also set to UTF-8 (and that the browser recognizes that by correctly setting the encoding to UTF-8 - see View/Encoding).

Or in simpler terms: if the application that created the content did not use Unicode characters, you will have to switch page encoding if you switch between Chinese, Japanese, and European characters.

If you have correctly encoded Unicode content in your database, and you use UTF-8 encoding on your pages, you should not have a problem with displaying any special characters (as long as you use a Unicode font on the page):

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

I realize that desite several edits I am not being very clear, so let me add some basics.

A character set is a standardized representation of a set of characters (e.g. ASCII, UNICODE, ...).

Character encoding is the binary representation used to store characters of a given character set. ASCII has its own encoding. Unicode, which is a very large character set designed to support all characters in existance, has several encodings (UTF-8, UTF-16, UCS-2, ...).

Only Unicode gives you the ability to support Western and Far eastern content at the same time with the same database and application settings. There are, however, older character sets for the Chinese and Japenese language that are not Unicode. If your content is not Unicode (BIG 5, for instance), you cannot display it on a UTF-8 encoded web page.

This can become tricky if the application that created the content used one encoding (e.g. BIG-5) and the database stored it as Unicode data. If this happens, information could have been lost.

You even have to install the corresponding language packs in Windows in order to see the characters correctly. Unfortunately, encoding issues are not simple to diagnose.

cdonner
Is there a way to determine the encoding that a field is using?
dhulk
When you query the database in Management Studio, what do you see? The same garbage that you get on the web page, or correct data?If you don't see correct data, your data is bad. You can find out what is stored by using the asc() function, then looking up the codes. Gotta run, more later.
cdonner
Yeah it's the same nonsense in the DB as on the page. We're gonna try and get some of it from a backup and see if that's correct. Thanks for all the help though, it's appreciated.
dhulk
When you browse to http://www.yahoo.co.jp, do you see Japenese or little boxes? You do need the language support on the client.
cdonner
Yeah Japanese when displayed properly works fine on my machine. In fact, it worked fine on the site I'm trying to fix like 3 months ago.
dhulk
I wish I could help you any further, but I don't have access to SQL Server where I am right now. I can't try out my recommendations.Since it looks like your data encoding is incorrect, I would look at the byte codes next and try to determine what is actually stored in the table.
cdonner
+3  A: 

There could be a couple of issues here, but since you say that you resolved this before, it may simply be a browser display issue. You should make sure that you have the encoding set correctly and language packs installed. You could check this on a couple of different computers and browsers to determine if its a problem with a specific machine, browser, or a general issue.

Otherwise, are you using nvarchar or ntext fields in all of your database tables? If not, then you're losing the chinese and japanese characters at that level. Also, if you're using any stored procedures, functions, etc. you need to make sure that variables are also nvarchar or ntext.

Finally, rechecl that your ASP pages are preserving the encoding in all places. I'm not very familiar with ASP classic, so I will let someone else help with that.

Justin Gallagher
I tried changing the character encoding in FF to a bunch of different ones. The closest it got to looking proper was using "Japanese (SHIFT_JIS)". Being relatively ignorant of different encoding types, I have no idea what that means. The data is stored in a column of type ntext.
dhulk
I have mine set to Unicode (UTF-8) and I am able to view Chinese characters. Try that, and if you still have an issue then it's probably something to do with the ASP page. Is this the same page that had the issue previously?
Justin Gallagher
It's actually a bunch of pages, but I'm just experimenting with fixing it on one page. The old fix worked fine (changing it to utf-8), but now it's all not working again and I don't personally have access to the old data, I have to wait on that.
dhulk
+1  A: 

Do you have the following in your ASP files?

<%@codepage=65001%> Session.CodePage = 65001

Michael Pryor
A: 

ntext has been deprecated in SQL 2005 (http://geekswithblogs.net/johnsPerfBlog/archive/2008/04/16/ntext-vs-nvarcharmax-in-sql-2005.aspx). Not sure if it helps but you can try converting ntext to nvarchar.

David
That won't solve the problem, although it is interesting information.
JasonTrue
A: 

You said you cannot even read it from Management Studio. It is very important to check that is there any data lost already.

In order to know how to restore it, you have to know how it become corrupted.

  1. How did these words write to database? any transcoding (including hidden by ASP) has been done before it has been wrote to DB?

  2. What is actually stored in the database already? You can get first two/three bytes of the "broken" words, and compare their byte range to common charset.

If the data are came from browser, you should check the encoding of page of the form. Browsers use the encoding of the page to encoding and submit data. If the charset/encoding does not match the receiver (e.g. your ASP page), it may decoded the words incorrectly.

Dennis Cheung
A: 

You cannot use UTF-8 to display Japanese or Chinese as these are double byte characters and require UTF-16. You will need this tag in the head element of the web page;

<meta http-equiv="Content-Type" content="text/html; charset=utf-16"/>

See the W3C Character Sets Tutorial for more details about handling character encoding.

If you are storing the raw bytes in a BLOB field in the database you should be ok but you will need to check that the data is not altered before it gets rendered on the webpage.

Dave Anderson
Sorry but this is simply not true. you most certainly can use utf-8 to represent Japanese and Chinese and Korean characters. utf-8 is a variable width encoding.
blackkettle
@blackkettle Thanks, never knew about variable width encoding until now. So ideograms may use three bytes in UTF-8, but only two in UTF-16 whereas ASCII is one byte in UTF-8 or two in UTF-16. If storage space is an issue you need to be careful when you choose your encoding.
Dave Anderson
Not a good idea to use UTF-16 in a web page, as it's not really MIME compatible. OK to use for storage, of course.The information about UTF-8 is wrong, as BlackKettle pointed out. UTF-8 can be as long as 4 octets (4 bytes). "double-byte" only makes sense when referring to legacy encodings.
JasonTrue
A: 

If you modified the database, then the most likely culprit is in the storage of the fields. You may pass the fields via a variable that is not ntext, but rather just text or varchar. That will kill the data going in, and then it will look wrong coming back on the web page.

What do you use to insert the data into the database?

Yishai
A: 

are you absolutely certain that your schema, database has the default character set set correctly?

blackkettle
A: 

I suspect you have several problems.

There are actually several common ways to represent Japanese and Chinese text, using legacy encodings (Shift_JIS, EUC-JP, and JIS-variants for Japanese, and several others for Chinese) or Unicode (UTF-8 or UTF-16). For a multilingual application, the preferred solution is to transmit page content in UTF-8; Windows itself prefers to store content in UTF-16 (which is what NTEXT and NVARCHAR use in MS SQL Server).

In order to get Japanese content to display correctly, you need to make sure the proper conversions are happening at every stage in your data pipeline. Let's assume that you're going to use Unicode for the sake of sanity, but the answer would be similar if you intentionally elected to use Shift-JIS, big5, gb2312 or something, just more complicated.

If your data is primarily coming from web forms, you need to make sure that your codepage is set to 65001, usually using the <%@codepage=65001%> directive at the top of each ASP file.

Additionally, you need to provide a hint to your user-agents (the web browser) that you are using UTF-8. There are two techniques, one involving an HTTP header; the other option is to fake the HTTP header with a meta tag.

The meta tag solution:

The HTTP header solution, using my rusty ASP skills (assuming javascript, but you're probably using vbscript, which would require you to drop the semicolons) Response.ContentType="text/html"; Response.Charset="utf-8";

If you are taking data into MSSQL in feeds, rather than web forms, you'll also need to make sure the data is converted properly. Depending on your import mechanism, the method for specifying the source encoding is different, so I'll have to leave that as an "exercise for the reader."

Next, when submitting your data to SQL server, you need to make sure you're using the correct SQL input mechanism. If you're not parameterizing your queries (and you should be), you need to remember to use the N'MyText' form rather than 'MyText' when putting text parameters in your query. If you're parameterizing your text, when you're using adVarChar, you should be using adVarWChar instead. (There are corresponding "W" types for each ADO data type).

Additionally, some browsers use the HTML LANG attribute as a hint for displaying text in a suitable font for the content language. If you happen to know which language your content is in, you can add LANG="ja-jp" to any HTML element (including BODY). Then a reasonable default font for that language should be used by the browser (but you can explicitly specify one if you like). Most browsers made in the last 5 years do some font-linking magic even if you choose an inappropriate default font for a particular language, but you'll get more reliable results and slightly better rendering performance if you use an appropriate font.

As an additional note, If you're getting almost-correct results when manually forcing the encoding as shift-jis on the browser, that means that you are probably using windows-1252 as your charset <%@codepage=1252%> and that you're getting lucky that the content hasn't been messed up entirely. There are a couple of hacks that can restore hosed Shift-Jis-in-1252 or iso-8859-1, but they're not 100% reliable.

As for collation on SQL server, this has two impacts. On NVARCHAR and NTEXT fields, it only affects sorting and querying (including case, accent and kana-sensitivity). On varchar and text fields, it also affects the encoding, but it's not the most sensible solution to your problem.

JasonTrue