views:

38

answers:

4

I'm trying to read some text from an html file, modify it in a specific way and write the result in a new html file. But the problem is that the text is not written in English and as a result some characters are replaced with black and white "?" marks. In my html file, I have < meta http-equiv="Content-Type" content="text/html; charset=utf-8">. What am I doing wrong? Maybe not the right Readers and Writers?

StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new FileReader("inputFile.html"));
String line;
while ( (line = br.readLine()) != null) {
     sb.append(line);
}
String result = doSomeChanges(sb);
BufferedWriter out = new BufferedWriter(new FileWriter("outputFile.html")); 
out.write(result); 
out.close(); 
+1  A: 

FileReader and FileWriter use the platform default encoding, which isn't what you want here. (I've always viewed this as a fatal flaw in these APIs.)

Instead, use FileInputStream and FileOutputStream, wrapped in an InputStreamReader and OutputStreamWriter respectively. This allows you to explicitly set the encoding - which in this case should be UTF-8.

Jon Skeet
+3  A: 

Maybe not the right Readers and Writers?

Exactly. FileReader and FileWriter are garbage; forget that they exist. They implicitly use the platform default encoding and do not allow you to override this default.

Instead, use this:

BufferedReader br = new BufferedReader(
    new InputStreamReader(new FileInputStream("inputFile.html"), "UTF-8"));

BufferedWriter out = new BufferedWriter(
    new OutputStreamWriter(new FileOutputStream("outputFile.html"), "UTF-8"));
Michael Borgwardt
It works (sun) Thank you very much!
brain_damage
+1  A: 

You use BufferedReader, which ignores the html-structure of the file. Thats why <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> has no effect.

Try this one:

BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("zzz"), "utf8")));
falagar
No, it's got nothing to do with BufferedReader itself. It's the use of FileReader which is the problem.
Jon Skeet
A: 

To make life easier you can also use FileUtils from the Apache Commons IO project which has read and write methods for Files and Strings which consider encoding.

K. Claszen