views:

177

answers:

4

A browser base application which intends to show data in English and capture data in English need to have a UTF-8 database?

Is there any problem if the site is accessed on a Japanese language Operating System? If user types only in English do we need to take any extra care? If user types in Japanese then how system can detect and throw an error?

The website will be developed in .Net 3.5.

EDIT---------------------------------------------------------------------------------------

I don't want to capture Japanese or any other language. The site will be completely English and user should be entering information also in English. Displaying English characters on Japanese OS is also not a problem. The problem is if user on a Japanese OS types Japanese characters in textbox then how can I identify that and throw an error to user? Secondly would he be able to type English characters in textbox?

+1  A: 

Well, you could check for non-'english' characters easily enough (Regular expression I suppose), but I don't see why you would. But you could do that.

I also don't really ever see a good reason not to use NVARCHAR for user-supplied text fields. Requirements often change.

Noon Silk
+1  A: 

It's always easier to build multibyte charset support into an application from the beginning rather than retrofit it in later.

In addition to having to revisit all the code, you'll end up with errors converting your existing database to unicode, and you may find out that there's no good way to determine what character set a given piece of data was actually encoded in in the first place.

Richard Pistole
+2  A: 

I don't think there are any strong reasons not to use UTF-8. You never know where strange characters may leak in.

Any incoming data should be processed and re-encoded. With html forms you can supply the following tag:

<input type="hidden" name="_charset_" value="" />

All browsers should populate this with the charset the user is using, you can then use this to decode/re-encode the input.

Also, if you haven't read it, read Joel's post on Unicode: http://www.joelonsoftware.com/articles/Unicode.html

monkut
+3  A: 

Japanese fonts and input methods have "two" versions of the 'english' characters in Unicode - the normal width and the 'wide/monospaced' ones (which are useful when printed top-to-bottom versus left-to-right). Be careful how you attempt to 'filter out' non-english characters - if you raise an error for example #2 below your users will be very confused!

1) correctly encoded

2) correctly encoded

The second line is NOT a different font or 'encoding' - they are additional fixed-width copies of our alphabet that align nicely within blocks of hiragana/katakana/kanji (Japanese writing).

I would definitely consider UTF8 encoding and NCHAR/NVARCHAR in the database.

CraigD