tags:

views:

156

answers:

2

We are trying to download source of webpages, however we cannot see some specific characters -like ü,ö,ş,ç- propoerly due to character encoding. We tried the following code in order to convert encoding of the string ("text" variable):

byte[] xyz = text.getBytes();
text = new String(xyz,"windows-1254"); 

We observed that if encoding is utf-8, we still cannot see pages correctly. What should we do?

+2  A: 

Tell the String constructor to use the UTF-8 encoding to interpret the bytes, if you know the page encodes its contents as UTF-8.

However I am not sure this is the extent of your problem. You have "text" already before trying to "convert" it. This means something has already tried to interpret the bytes of the page as a String, according to some encoding. If that was the wrong encoding, nothing you do later can necessarily fix it.

Instead you need to fix this upstream.

byte[] bytesOfThePage = ...;
String text = new String(bytesOfThePage, "UTF-8");
Sean Owen
A: 

The problem is likely exactly there where you're reading, writing and/or displaying those characters.

If you're reading those characters using a Reader, then you need to construct an InputStreamReader first using the 2-argument constructor wherein you can pass the correct encoding (thus, UTF-8) as 2nd argument. E.g.

reader = new InputStreamReader(url.openStream(), "UTF-8");

If you're for example writing those characters to a file, then you need to construct an OutputStreamWriter using the 2-argument constructor wherein you can pass the correct encoding (thus, UTF-8) as 2nd argument. E.g.

writer = new OutputStreamWriter(new FileOutputStream("/page.html"), "UTF-8");

If you're for example writing it all plain vanilla to the stdout (e.g. System.out.println(line) and so on, then you need to ensure that the stdout itself is using the correct encoding (thus, UTF-8). In an IDE such as Eclipse you can configure it by Window > Preferences > General > Workspace > Encoding.

BalusC