views:

86

answers:

1

Hi,

I just got a project in Borland JBuilder 2006 that I cannot even build. I have two resource files, one with Simplified Chinese text and the other in Traditional Chinese. When I try to build the project the text is misinterpreted and it sees an "illegal escape character".

Now if I set the encoding in Project -> Project Properties -> General -> Encoding to GB2312, the Simplified Chinese text shows up correctly. However the Traditional Chinese resource is still garbled.

I think for Traditional Chinese, this setting should be set to Big5, but even that does not work.. And when I do set it to Big5, then Simplified Chinese gets corrupted.

The previous developer that was working on this had left without getting a chance to show me how to build this project..

Any ideas?

Thanks,

kreb

+1  A: 

They're called "Res_SChinese.java" and "Res_TChinese.java"

I assume that these must be Java class files, though I am surprised that they are in different encodings.

Having source files in multiple encodings is highly undesirable. If you don't know what character set a source file has, you can use the ICU project libraries to help you guess:

  public static void main(String[] args) throws IOException {
    InputStream file = new FileInputStream(args[0]);
    try {
      file = new BufferedInputStream(file);
      CharsetDetector detector = new CharsetDetector();
      detector.setText(file);
      String tableTemplate = "%10s %10s %8s%n";
      System.out.format(tableTemplate, "CONFIDENCE",
          "CHARSET", "LANGUAGE");
      for (CharsetMatch match : detector.detectAll()) {
        System.out.format(tableTemplate, match
            .getConfidence(), match.getName(), match
            .getLanguage());
      }
    } finally {
      file.close();
    }
  }

Note that the number of Chinese character encodings it can detect is limited (ISO-2022-CN, GB18030 and Big5), but at least it might help you find out if everything is just encoded in a Unicode transformation format or something.


Eclipse (JBuilder is Eclipse-based now, isn't it?) can set encodings for individual files. You can set the encoding Eclipse uses for a file by right-clicking it and selecting Properties. The encoding is under the Resource properties. this is difficult to manage and won't apply to any external tools you use (like an Ant build script).

It is possible to compile files using a different encoding using external. For example:

javac -encoding GB18030 Foo.java

But if these classes have interdependencies, that is going to get painful fast.


Faced with multiple encodings, I would translate all the files to a single encoding. There are a couple of options here.

Use a Latin-1 subset

Java supports Unicode escape sequences in source files. So, the Unicode character U+6874 桴 can be written as the literal \u6874. The JDK tool native2ascii can be used to transform your Java files to Latin-1 values.

native2ascii -encoding GB2312 FooIn.java FooOut.java

The resultant files will probably compile anywhere without problem, but might be a nightmare for anyone reading/editing the files.

Use GB18030

GB18030 is a huge character set, so if this is your native encoding, it might be an idea to use that (otherwise, if I was going this route, I'd use UTF-8).

You can use code like this to perform the transformation:

  public static void main(String[] args) throws IOException {
    changeEncoding("in_cn.txt", Charset.forName("GBK"),
        "out_cn.txt", Charset.forName("GB18030"));
  }

  private static void changeEncoding(String inFile,
      Charset inCharset, String outFile, Charset outCharset)
      throws IOException {
    InputStream in = new FileInputStream(inFile);
    Reader reader = new InputStreamReader(in, inCharset);
    OutputStream out = new FileOutputStream(outFile);
    Writer writer = new OutputStreamWriter(out, outCharset);
    copy(reader, writer);
    writer.close();
    reader.close();
    // TODO: try/finally blocks; proper stream handling
  }

  private static void copy(Reader reader, Writer writer)
      throws IOException {
    char[] cbuf = new char[1024];
    while (true) {
      int r = reader.read(cbuf);
      if (r < 0) { break; }
      writer.write(cbuf, 0, r);
    }
  }


If I open them in Notepad, i can view them both properly even with just the locale set to Chinese (PRC)

Notepad uses a heuristic character encoding detection mechanism. It doesn't always work.

McDowell
Damn, great answer man! +100000000 :D Thanks..And yes, the files seem to be in multiple encodings :SI did find a quick solution though, it turns out that setting the locale for non-unicode programs to Chinese PRC isn't enough, I had to change the Formats setting to Chinese (PRC) as well.. It enabled me to compile the project fine (and view the files OK). However, your post has been quite helpful and I might just use them later on to convert them all to UTF8.. :) Thanks :)
krebstar