views:

92

answers:

2

We are building a (Java) web project with Eclipse. By default Eclipse uses Cp1252 encoding on Windows machines (which we use).

As we also have developers in China (in addition to Europe), I started to wonder if that is really the encoding to use.

My initial thought was to convert to UTF-8, because "it supports all the character sets". However, is this really wise? Should we pick some other encoding instead? I see couple of issues:

1) How do web browser interpret the files by default? Does it depend on what language version one is using? What I am after here is that should we verbosely declare the encoding schemes used:

  • XHTML files can set the encoding verbosely using <?xml version='1.0' encoding='UTF-8' ?> declarations.
  • CSS files can set this by @CHARSET "UTF-8";.
  • JavaScript files do not have in-file declarations, but one can globally define <meta http-equiv="Content-Script-Type" content="text/javascript; charset=utf-8"> or <script type="text/javascript" charset="utf-8"> for specific scripts.

What if we leave CSS file without @CHARSET "UTF-8"; declaration? How does the browser decide how it is encoded?

2) Is it wise to use UTF-8, because it is so flexible. By locking our code into Cp1252 (or maybe ISO-8859-1) I can ensure that foreign developers don't introduce special characters into files. This effectively prevents them from inserting Chinese comments, for example (we should use 100% english). Also, allowing UTF-8 can sometimes allow developers accidentally introduce some strange characters, that are difficult/impossible to perceive with human eye. This occurs when people, for example, copy-paste text or happen to press some weird keyboard combination accidentally.

It would seem that allowing UTF-8 in the project just brings problems...

3) For internatioanlization, I initially considered UTF-8 a good thing ("how can you add translations if the file encoding doesn't support the characters one needs?"). However, as it turned out, Java Resource Bundles (.properties files) must be encoded with ISO-8859-1, because otherwise they might break. Instead, the international characters are converted into \uXXXX notation, for example \u0009 and the files are encoded with ISO-8859-1. So... we are not even able to use UTF-8 for this.

For binary files... well, the encoding scheme doesn't really matter (I suppose one can say it doesn't even exist).

How should we approach these issues?

+5  A: 

I'd definitely recommend UTF-8 over all other encoding schemes.

Make sure that your DBMS is fully UTF-8 compliant if you're storing multilingual data in a database

Also, ensure that all files, including css, javascript, application template files are themselves encoded in UTF-8 with BOM. If not, the charset directives may not be interpreted correctly by the browser.

We have over 30 languages in a big database-backed CMS and it's working like a charm. The client has human editors for all languages who do the data entry.

You may run into collation issues with some languages (the example of the dreaded Turkish dotless i - ı - in case-insensitive databases springs to mind). There's always an answer to that, but it'll be very database-specific.

I am not familiar with the specifics of Java Resource Bundles. We do use some Java libraries like markdownj that process UTF-8 encoded text in and out of the database without problems.


Edited to answer the OP's comments:

I think the main reason for mainstreaming UTF-8 is that you never know in what direction your systems will evolve. You may assume that you'll only be handling one language today but that's not true even in perfectly monolingual environments, as you may have to store names, or references containing non US-ASCII octet values.

Also, a UTF-8 encoded character stream will not alter US-ASCII octet values, and this provides full compatibility with non UTF-8 enabled file systems or other software.

Today's modern browsers will all interpret UTF-8 correctly provided the application/text file was encoded with UTF-8 and you include the <meta charset="utf-8"> on any page that's served to a browser.

Do check whether your middleware (php, jsp, etc) supports UTF-8 anywhere, and do so in conjunction with your database.

I fail to see what the problem is with developers potentially dealing with data they don't understand. Isn't that also potentially the case when we deal with data in our own native languages? At least with a fully unicode system they'll be able to recognize whether the glyphs they see in the browser or in the database match the language they're supposed to be dealing with instead of getting streams of ???? ?????? ??? ????

I do believe that using UTF-8 as your character encoding for everything is a safe bet. This should work for pretty much every situation, and you're all set for the day you boss comes around and insists you must go multilingual.

Vincent Buck
+1 For utf-8 DBMS
HeDinges
Thanks for your suggestions, however I don't see this really answering my specific questions. Why should we actually use UTF-8 if we don't need it (do we need it)? How do browsers identify the encoding scheme? What is their default? Do you consider supporting UTF-8 and enabling insertation of bad characters into files an issue, etc.? Your point about UTF-8 in database is very relevant.
Tuukka Mustonen
Thanks for expanding your answer. My fears odd characters comes from personal history - I was accidentally pressing CTRL+Space, which seemed to introduce this strange character on a Linux machine (it seemed like a normal space). However, it resulted in build failures, as my editor saved the file as UTF-8 but compiler did not like that character. It took me a while to realize this. What if I would have put this character into a String and compiler wouldn't have nagged about it? I would possible have gotten these gylphs in pages or some String-comparison would have failed without idea why.
Tuukka Mustonen
And yeah, String comparison is nasty, but sometimes you need it :)
Tuukka Mustonen
+2  A: 

My initial thought was to convert to UTF-8, because "it supports all the character sets". However, is this really wise?

Go for it. You want world domination.

1) How do web browser interpret the files by default? Does it depend on what language version one is using?

It uses the Content-Type response header for this (note, the real response header, not the HTML meta tag). I see/know that you're a Java developer, so here are JSP/Servlet targeted answers: setting <%@page pageEncoding="UTF-8" %> in top of JSP page will implicitly do this right and setting response.setCharacterEncoding("UTF-8") in Servlet/Filter does the same. If this header is absent, then it is entirely up to the browser to decide/determine the encoding. MSIE will plain use the platform default encoding. Firefox is a bit smarter and will guess the encoding based on page content.

2) Is it wise to use UTF-8, because it is so flexible. By locking our code into Cp1252 (or maybe ISO-8859-1) I can ensure that foreign developers don't introduce special characters into files.

I would just writeup a document describing team coding conventions and spread this among developers. Every self-respected developer know that s/he risk to get fired when not adhering this.

3) For internatioanlization, I initially considered UTF-8 a good thing ("how can you add translations if the file encoding doesn't support the characters one needs?"). However, as it turned out, Java Resource Bundles (.properties files) must be encoded with ISO-8859-1, because otherwise they might break.

This is solved since Java 1.6 with new Properties#load() method taking a Reader and the new ResourceBundle.Control class wherein you can control the loading of the bundle file. In JSP/Servlet terms, usually a ResourceBundle is been used. Just set the message bundle name to the full qualified classname of the custom ResourceBundle implementation and it will be used.

For binary files... well, the encoding scheme doesn't really matter (I suppose one can say it doesn't even exist).

The encoding is indeed only interesting whenever one want to convert computer readable binary data to human readable character data. For "real" binary content it indeed doesn't make any sense since the binary format doesn't represent any sensible character data.

See also:

BalusC
1) If my JSF XHTML contains UTF-8 declaration at the top, does it get UTF-8 as `Content-Type` HTTP-header as well? Also, does it then use that same header for CSS and JS as well? I suppose this is dependes on web framework (JSF in this case)? 2) That is true, however, there are mistakes and less dedicated people. Simple documentation might be viable but it is a (little) risk anyway. 3+others) Thanks for pointers, I will check those out.
Tuukka Mustonen
The XML header is only interesting for the XML tool you're using to process the XML tree (in this particular case, Facelets). Even then, the XML header is fully optional. Facelets also by default uses UTF-8 to process the XML tree. As to the CSS and JS: you need to set the response encoding (which in turn implicitly sets the correct charset in the header). A Filter is a good place for the job.
BalusC
Ok, sounds like UTF-8 is really the way to go then. I feel you've answered my questions in both specific and broad scope, so I'm picking this as the accepted answer. Off to world domination!
Tuukka Mustonen
Yeah, World Domination! Good luck :)
BalusC