views:

350

answers:

5

How important is it to save your source code in UTF-8 format?

Eclipse on Windows uses CP1252 character encoding by default. The CP1251 format means non UTF-8 characters can be saved and I have seen this happen if you copy and paste from a Word document for a comment.

The reason I ask is because out of habit I set-up Maven encoding to be in UTF-8 format and recently it has caught a few non mappable errors.

(update) Please add any reasons for doing so and why, are there some common gotchas that should be known?

(update) What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer, right now I don't.

+2  A: 

Yes, unless your compiler/interpreter is not able to work with UTF-8 files, it is definitely the way to go.

poke
...which in javac can be controlled with `-encoding` argument by the way. Good point though, +1.
BalusC
"it is definitely the way to go" because ...
JamesC
+3  A: 

Important is at least that you need to be consistent with the encoding used to avoid herrings. Thus not, X here, Y there and Z elsewhere. Save source code in encoding X. Set code input to encoding X. Set code output to encoding X. Set characterbased FTP transfer to encoding X. Etcetera.

Nowadays UTF-8 is a good choice as it covers every character the human world is aware of and is pretty everywhere supported. So, yes, I would set workspace encoding to it as well. I also use it so.

BalusC
What herrings? If source is built on Windows and executed on *nix would that be a good reason to define your encoding?
JamesC
I assume these are rare but very possible.
JamesC
For example, yes. Default encoding namely differs at both platforms. This does not affect technical functionality of Java code in any way however (Java literals/keywords are namely already part of ASCII, which is basically the base of all other encodings (expect of EBCDIC, but that's a different story)), but it *may* result in erroneous input/output.
BalusC
No, Java identifier are not necessarily only Ascii char. This is a valid int declaration (at least javac and eclipse accept that): int é\u1212;
penpen
@penpen: I was talking about **literals/keywords** like `public`, `class`, `null`, etc, not about identifiers.
BalusC
Sorry, I should have took my time before commenting.
penpen
+8  A: 

What is your goal? Balance your needs against the pros and cons of this choice.

UTF-8 Pros

  • allows use of all character literals without \uHHHH escaping

UTF-8 Cons

  • using non-ASCII character literals without \uHHHH increases risk of character corruption
    • font and keyboard issues can arise
    • need to document and enforce use of UTF-8 in all tools (editors, compilers build scripts, diff tools)
  • beware the byte order mark

ASCII Pros

  • character/byte mappings are shared by a wide range of encodings
    • makes source files very portable
    • often obviates the need for specifying encoding meta-data (since the files would be identical if they were re-encoded as UTF-8, Windows-1252, ISO 8859-1 and most things short of UTF-16 and/or EBCDIC)

ASCII Cons

  • limited character set
  • this isn't the 1960s

Note: ASCII is 7-bit, not "extended" and not to be confused with Windows-1252, ISO 8859-1, or anything else.

McDowell
+1 Sums it nicely up :) Just outweigh yourself.
BalusC
What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer - thanks for the post.
JamesC
+1 for "beware the byte order mark"
finnw
There is only one good reason to store sources as UTF-8: if you comment in a language that needs non-ASCII characters. For UI/messages the strings should be stored in some kind of resource files/message catalogs. Good internationalization practice.
Mihai Nita
+2  A: 

I don't think there's really a straight yes or no answer to this question. I would say that the following guidelines should be used to pick an encoding format, in order of priority listed (highest to lowest):

1) Pick an encoding your tool chain supports. This is a lot easier than it used to be. Even in recent memory a lot of compilers and languages basically only supported ASCII, which more or less forced developers into coding in Western European languages. These days, many of the newer languages support other encodings, and almost all decent editors and IDEs support a tremendously long list of encodings. Still... there are just enough holdouts that you need to double check before you settle on an encoding.

2) Pick an encoding that supports as many of the alphabets you wish to use as possible. I place this as a secondary priority because frankly, if your tools don't support it it doesn't really matter whether you like the encoding better or not.

UTF-8 is an excellent choice in many circumstances of today's world. It's an ugly, inelegant format, but it solves a whole host of problems (namely dealing with legacy code) that break other encodings, and it seems to becoming more and more the de facto standard of character encodings. It supports every major alphabet, darn near every editor on the planet supports it now, and a whole host of languages/compilers support it, too. But as I mentioned above, there are just enough legacy holdouts that you need to double check your tool chain from end to end before you settle on it definitively.

Russell Newquist
Thanks Russell.
JamesC
Strongly disagree with the "ugly, inelegant format" part. UTF-8 is pretty much a masterpiece as far as I'm concerned: backwards-compatible, more space-efficient than most people think (yes, even for Asian languages), can be picked up mid-stream, easily identifiable in most cases, doesn't require a BOM, binary-sortable...
Cowan
Don't misunderstand me - given the constraints under which they were working, I'm quite impressed with the format. But the honest reality is that if we were starting from scratch today, we'd just be using a straight 32 or 64-bit character set, end of story. Pure elegance in its simplest form.
Russell Newquist
+2  A: 

Eclipse's default setting of using the platform default encoding is a poor decision IMHO. I found it necessary to change the default to UTF-8 shortly after installing it because some of my existing source files used it (probably from snippets copied/pasted from web pages.)

The Java Language and API specs require UTF-8 support so you're definitely okay as far as the standard tools go, and it's a long time since I've seen a decent editor that did not support UTF-8.

Even in projects that use JNI, your C sources will normally be in US-ASCII which is a subset of UTF-8 so having both open in the same IDE will not be a problem.

finnw