views:

4183

answers:

6

I use the following code to save Chinese characters into a .txt file, but when I opened it with wordpad, I can't read it.

StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;

FileOutputStream fos;
fos=new FileOutputStream(FileName,Append);
for (int i=0;i<Shanghai_StrBuf.length();i++) fos.write(Shanghai_StrBuf.charAt(i));
fos.close();

What can I do ? I know if I cut and paste Chinese characters into a wordpad I can save it into a .txt file. How to do that with java ?

+2  A: 

If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:

    Writer w = new FileWriter("test.txt");
    w.append("上海");
    w.close();

The safest way is to always explicitly specify the encoding:

    Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
    w.append("上海");
    w.close();

P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.

Esko Luontola
I'd like to, but since I use Netbeans, after I cut and pasted Chinese into java file and saved it, it won't show up(only see ???) when I re-open the java file in Netbeans.
Frank
Maybe NetBeans is configured to use some non-Unicode encoding, or the editor's font does not have all Unicode characters. I don't use NetBeans, but from its help file I see that you set the encoding at Project Properties | Sources | Encoding.
Esko Luontola
Are you sure that using which encoding the file was saved, if you saved it using some other editor?
Esko Luontola
A: 

Here's one way among many. Basically, we're just specifying that the conversion be done to UTF-8 before outputting bytes to the FileOutputStream:

String FileName = "output.txt";

StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
boolean Append=true;

Writer writer = new OutputStreamWriter(new FileOutputStream(FileName,Append), "UTF-8");
writer.write(Shanghai_StrBuf.toString(), 0, Shanghai_StrBuf.length());
writer.close();

I manually verified this against the images at http://www.fileformat.info/info/unicode/char/ . In the future, please follow Java coding standards, including lower-case variable names. It improves readability.

Matthew Flaschen
A: 

Try this,

StringBuffer Shanghai_StrBuf=new StringBuffer("\u4E0A\u6D77");
 boolean Append=true;

 Writer out = new BufferedWriter(new OutputStreamWriter(
        new FileOutputStream(FileName,Append), "UTF8"));
 for (int i=0;i<Shanghai_StrBuf.length();i++) out.write(Shanghai_StrBuf.charAt(i));
 out.close();
+1  A: 

Be very careful with the approaches proposed. Even specifying the encoding for the file as follows:

Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");

will not work if you're running under an operating system like Windows. Even setting the system property for file.encoding to UTF-8 does not fix the issue. This is because Java fails to write a byte order mark (BOM) for the file. Even if you specify the encoding when writing out to a file, opening the same file in an application like Wordpad will display the text as garbage because it doesn't detect the BOM. I tried running the examples here in Windows (with a platform/container encoding of CP1252).

The following bug exists to describe the issue in Java:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

The solution for the time being is to write the byte order mark yourself to ensure the file opens correctly in other applications. See this for more details on the BOM:

http://mindprod.com/jgloss/bom.html

and for a more correct solution see the following link:

http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html

Jon
I expected to get a shrimp, now I found a shark and a shark killer ! Thanks.In the "correct solution" you posted, why are the "init();" lines commented out in Close() and read() ? Should I uncomment them to run correctly ?
Frank
Not entirely sure, but it shouldn't matter for writing purposes, only for reading. If you're reading a UTF-8 file back you need to skip the BOM, as it confuses the heck out of Java - that's what the init method does. Might be worth contacting the blog author to find out the rationale behind it. Sorry I can't be of more help.
Jon
You could possibly scrap the code reading part. Looks like Apache have had a go at creating their own BOMExclusionReader, see: https://issues.apache.org/jira/browse/IO-178
Jon
Java does not automatically write a UTF-8 BOM because in many cases it would be an error to do so. http://unicode.org/faq/utf_bom.html#BOM
McDowell
+4  A: 

There are several factors at work here:

  • Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
  • The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
  • To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
  • It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);

Here is a method of reliably appending UTF-8 data to a file:

  private static void writeUtf8ToFile(File file, boolean append, String data)
      throws IOException {
    boolean skipBOM = file.isFile() && (file.length() > 0);
    Closer res = new Closer();
    try {
      OutputStream out = res.using(new FileOutputStream(file, append));
      Writer writer = res.using(new OutputStreamWriter(out, Charset
          .forName("UTF-8")));
      if (!skipBOM) {
        writer.write('\uFEFF');
      }
      writer.write(data);
    } finally {
      res.close();
    }
  }

Usage:

  public static void main(String[] args) throws IOException {
    String chinese = "\u4E0A\u6D77";
    boolean append = true;
    writeUtf8ToFile(new File("chinese.txt"), append, chinese);
  }

Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.

Here is the Closer type used in this code:

public class Closer implements Closeable {
  private Closeable closeable;

  public <T extends Closeable> T using(T t) {
    closeable = t;
    return t;
  }

  @Override public void close() throws IOException {
    if (closeable != null) {
      closeable.close();
    }
  }
}

This code makes a Windows-style best guess about how to read the file based on byte order marks:

  private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
      Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };

  private static Charset getEncoding(InputStream in) throws IOException {
    charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
      byte[] bom = "\uFEFF".getBytes(encodings);
      in.mark(bom.length);
      for (byte b : bom) {
        if ((0xFF & b) != in.read()) {
          in.reset();
          continue charsetLoop;
        }
      }
      return encodings;
    }
    return Charset.defaultCharset();
  }

  private static String readText(File file) throws IOException {
    Closer res = new Closer();
    try {
      InputStream in = res.using(new FileInputStream(file));
      InputStream bin = res.using(new BufferedInputStream(in));
      Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
      StringBuilder out = new StringBuilder();
      for (int ch = reader.read(); ch != -1; ch = reader.read())
        out.append((char) ch);
      return out.toString();
    } finally {
      res.close();
    }
  }

Usage:

  public static void main(String[] args) throws IOException {
    System.out.println(readText(new File("chinese.txt")));
  }

(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)

McDowell
Frank
I've added code that does a best-guess at reading arbitrary text files.
McDowell
Great ! That's exactly what I'm looking for ! I wish this is part of Sun's Java package, not something we need to worry about. Thanks !
Frank