tags:

views:

121

answers:

4

Hi, i have a file that have some non-utf8 caracters (like "ISO-8859-1"), and so i want to convert that file (or read) to UTF8 encoding, how i can do it?

The code it's like this:

File file = new File("some_file_with_non_utf8_characters.txt");

/* some code to convert the file to an utf8 file */

...

edit: Put an encoding example

A: 

You only want to read it as UTF-8? What I did recently given a similar problem is to start the JVM with -Dfile.encoding=UTF-8, and reading/printing as normal. I don't know if that is applicable in your case.

With that option:

System.out.println("á é í ó ú")

prints correctly the characters. Otherwise it prints a ? symbol

Ismael
http://bugs.sun.com/view_bug.do?bug_id=4163515
McDowell
@McD: I was going to post the same comment. This is a misinterpretation of the use of the `-Dfile.encoding`.
BalusC
I see, it really is a mess.
Ismael
+3  A: 

You need to know the encoding of the input file. For example, if the file is in Latin-1, you would do something like this,

        FileInputStream fis = new FileInputStream("test.in");
        InputStreamReader isr = new InputStreamReader(fis, "ISO-8859-1");
        Reader in = new BufferedReader(isr);
        FileOutputStream fos = new FileOutputStream("test.out");
        OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
        Writer out = new BufferedWriter(osw);

        int ch;
        while ((ch = in.read()) > -1) {
            out.write(ch);
        }

        out.close();
        in.close();
ZZ Coder
Summarized: **read** it in the file's own encoding and then **write** it in the new encoding.
BalusC
+2  A: 
  String charset = "ISO-8859-1"; // or what corresponds
  BufferedReader in = new BufferedReader( 
      new InputStreamReader (new FileInputStream(file), charset));
  String line;
  while( (line = in.readLine()) != null) { 
    ....
  }

There you have the text decoded. You can write it, by the simmetric Writer/OutputStream methods, with the encoding you prefer (eg UTF-8).

leonbloy
It is not necessary to read line by line
OscarRyz
of course not, it's just one posible way.
leonbloy
@leonbloy - the potential problem with reading line-by-line is that you can alter line endings / separations. For example, if the last line has no end-of-line, you will add one.
Stephen C
That's totally true. It's also true that frequently that effect is actually desirable (more a "polishing" than an "alteration"). But, yes, one must be aware of that.
leonbloy
+3  A: 

The following code converts a file from srcEncoding to tgtEncoding:

public static void transform(File source, String srcEncoding, File target, String tgtEncoding) throws IOException {
    BufferedReader br = null;
    BufferedWriter bw = null;
    try{
        br = new BufferedReader(new InputStreamReader(new FileInputStream(source),srcEncoding));
        bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), tgtEncoding));
        char[] buffer = new char[16384];
        int read;
        while ((read = br.read(buffer)) != -1)
            bw.write(buffer, 0, read);
    } finally {
        try {
            if (br != null)
                br.close();
        } finally {
            if (bw != null)
                bw.close();
        }
    }
}
Eyal Schneider
Ignore my comment, you are right. Btw, haven't seen this style of closing in finally before. Clever.
BalusC