views:

76

answers:

2

I have a little Java project where I've set the properties of the class files to UTF-8 (I use a lot of foreign characters not found on the default CP1252).

The goal is to create a text file (in Windows) containing a list of items. When running the class files from Eclipse itself (hitting Ctrl+F11) it creates the file flawlessly and opening it in another editor (I'm using Notepad++) I can see the characters as I wanted.

┌──────────────────────────────────────────────────┐
│                          Universidade2010 (18/18)│
│                                         hidden: 0│
├──────────────────────────────────────────────────┤

But, when I export the project (using Eclipse) as a runnable Jar and run it using 'javaw -jar project.jar' the new file created is a mess of question marks

????????????????????????????????????????????????????
?                          Universidade2010 (19/19)?
?                                         hidden: 0?
????????????????????????????????????????????????????

I've followed some tips on how to use UTF-8 (which seems to be broken by default on Java) to try to correct this so now I'm using

Writer w = new OutputStreamWriter(fos, "UTF-8");

and writing the BOM header to the file like in this question already answered but still without luck when exporting to Jar

Am I missing some property or command-line command so Java knows I want to create UTF-8 files by default ?

+2  A: 

I've followed some tips on how to use UTF-8 (which seems to be broken by default on Java)

For historical reasons, Java's encoding defaults to the system encoding (something that made more sense back on Windows 95). This behaviour isn't likely to change. To my knowledge, there isn't anything broken about Java's encoder implementation.

  private static final String BOM = "\ufeff";

  public static void main(String[] args) throws IOException {
    String data = "\u250c\u2500\u2500\u2510\r\n\u251c\u2500\u2500\u2524";
    OutputStream out = new FileOutputStream("data.txt");
    Closeable resource = out;
    try {
      Writer writer = new OutputStreamWriter(out, Charset.forName("UTF-8"));
      resource = writer;
      writer.write(BOM);
      writer.write(data);
    } finally {
      resource.close();
    }
  }

The above code will emit the following text prefixed with a byte order mark:

┌──┐
├──┤

Windows apps like Notepad can infer the encoding from the BOM and decode the file correctly.

Without code, it isn't possible to spot any errors.

Am I missing some property or command-line command so Java knows I want to create UTF-8 files by default?

No - there is no such setting. Some might suggest setting file.encoding on the command line, but this is a bad idea.


I wrote a more comprehensive blog post on the subject here.


This is a reworking of your code:

public class Printer implements Closeable {
  private PrintWriter pw;
  private boolean error;

  public Printer(String name) {
    try {
      pw = new PrintWriter(name, "UTF-8");
      pw.print('\uFEFF'); // BOM
      error = false;
    } catch (IOException e) {
      error = true;
    }
  }

  public void print(String s) {
    if (pw == null) return;
    pw.print(s);
    pw.flush();
  }

  public boolean checkError() { return error || pw.checkError(); }

  @Override public void close() { if (pw != null) pw.close(); }
}

Most of the functionality you want already exists in PrintWriter. Note that you should provide some mechanism to check for underlying errors and close the stream (or you risk leaking file handles).

McDowell
+1 for using Unicode escapes (`\u250c` etc) for writing down those special characters in the Java source file. This eliminates one possible source of problems: Different text editors might save the Java source file in different encodings.
cygri
Unfortunately the problem still remains, I've added the partial code in a new answer
RuntimeError
A: 

Hi again, the problem is not on the creating the file itself , because while developing the file is outputted correctly (with the unicode characters)

The class that creates the file is now (and following the suggestion of using the Charset class) like this:

public class Printer {

    File f;
    FileOutputStream fos;
    Writer w;
    final byte[] utf8_bom = { (byte) 0xEF, (byte) 0xBB, (byte) 0xBF };

    public Printer(String filename){
        f = new File(filename);
        try {
            fos = new FileOutputStream(f);
            w = new OutputStreamWriter(fos, Charset.forName("UTF-8"));
            fos.write(utf8_bom);
        } catch (FileNotFoundException e) {
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public void print(String s) {
        if(fos != null){
            try {
                fos.write(s.getBytes());
                fos.flush();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
    }

}

And all characters being used are defined as such:

private final char pipe = '\u2502';         /* │ */
private final char line = '\u2500';         /* ─ */
private final char pipeleft = '\u251c';     /* ├ */
private final char piperight = '\u2524';    /* ┤ */
private final char cupleft = '\u250c';      /* ┌ */
private final char cupright = '\u2510';     /* ┐ */
private final char cdownleft = '\u2514';    /* └ */
private final char cdownright = '\u2518';   /* ┘ */

The problem remains, when outputting to a file simply by running the project on Eclipse, the file comes out perfect, but after deploying the project to a Jar and running it the outputted file has the formatting destroyed (I've found out that they are replaced by the '?' char)

I've come to thinking this is not a problem with the code, is a problem from deploying it into a Jar file, I think Eclipse is compiling the source files to CP1252 or something, but even replacing all unicode chars by their code constants didn't help

RuntimeError
@RuntimeError - the bug is here: `fos.write(s.getBytes());` This call converts the string to bytes using the default character set (ANSI on most Windows installs) and writes them to the byte stream. You should be using your `Writer` to encode the bytes. I'll update my answer with an implementation. _FYI: this is not an answer to the question - it is generally better to edit your question with more detail._
McDowell
Thank you, the problem is solved, it seems Eclipse runs the java project in a UTF-8 by default environment (if set like that), as oppose to javaw which sets it to the default ANSI,and that was why I was having different outputs with the same code, also, thanks for the heads up I'm fairly new here and didn't know that
RuntimeError