ansaurus

Question

Converting HTML character encoding in Java

Answer 1

+2 A:

Tell the String constructor to use the UTF-8 encoding to interpret the bytes, if you know the page encodes its contents as UTF-8.

However I am not sure this is the extent of your problem. You have "text" already before trying to "convert" it. This means something has already tried to interpret the bytes of the page as a String, according to some encoding. If that was the wrong encoding, nothing you do later can necessarily fix it.

Instead you need to fix this upstream.

byte[] bytesOfThePage = ...;
String text = new String(bytesOfThePage, "UTF-8");

Sean Owen 2010-01-26 17:09:24

Answer 2

A:

The problem is likely exactly there where you're reading, writing and/or displaying those characters.

If you're reading those characters using a Reader, then you need to construct an InputStreamReader first using the 2-argument constructor wherein you can pass the correct encoding (thus, UTF-8) as 2nd argument. E.g.

reader = new InputStreamReader(url.openStream(), "UTF-8");

If you're for example writing those characters to a file, then you need to construct an OutputStreamWriter using the 2-argument constructor wherein you can pass the correct encoding (thus, UTF-8) as 2nd argument. E.g.

writer = new OutputStreamWriter(new FileOutputStream("/page.html"), "UTF-8");

If you're for example writing it all plain vanilla to the stdout (e.g. System.out.println(line) and so on, then you need to ensure that the stdout itself is using the correct encoding (thus, UTF-8). In an IDE such as Eclipse you can configure it by Window > Preferences > General > Workspace > Encoding.

BalusC 2010-01-26 17:40:36

ansaurus

tags:

views:

answers:

Converting HTML character encoding in Java

related questions