views:

32

answers:

2

HI! I have a web page content in encoded in ISO-8859-2. How to convert a stream encoded in this charset to java's UTF-8. I'm trying the code below, but it does not work. It messes up some characters. Is there some other way to do this?

    BufferedInputStream inp = new BufferedInputStream(in);
    byte[] buffer = new byte[8192];
    int len1 = 0;
    try{
        while ( (len1 = inp.read(buffer)) != -1 ) 
        {

            String buff = new String(buffer,0,len1,"ISO-8859-2");
            stranica.append(buff);
        } 
A: 

Try it with an InputStreamReader and Charset:

InputStreamReader inp = new InputStreamReader(in, Charset.forName("ISO-8859-2"));
BufferedReader rd = new BufferedReader(inp);
String l;
while ((l = rd.readLine()) != null) {
   ...
}

If you get an UnsupportedCharsetException, you know what's your problem... Also, with inp.getEncoding() you can check which encoding is really used.

king_nak
thanks, i'll try this later today...
Levara
it seems that the problem was that the encoding parameter should be "ISO8859-2" and not "ISO-8859-2"...
Levara
I doubt that. `ISO-8859-2` and `ISO8859-2` are both valid names for that encoding, and Java recognizes both of them.
Alan Moore
A: 

How to convert a stream encoded in this charset to java's UTF-8

Wrong assumption: Java uses UTF-16 internally, not UTF-8.

But your code actually looks correct and should work. Are you absolutely sure the webpage is in fact encoded in ISO-8859-2? Maybe its encoding is declared incorrectly.

Or perhaps the real problem is not with the reading code that you've shown, but with whatever code you use to work with the result. How and where do these "messed up characters" manifest?

Michael Borgwardt
i know that about utf-16, but, when a web page has in it's head (or whatever it's called) utf-8 declared, everything works perfectly. when ISO-8859-2 is declared, certain Croatian characters like (Č,ć,š,ć,đ,ž) end up being displayed as ?.
Levara
@Levara: Do those webpages look correct when you open them in a browser? If that displays '?' too, then it looks as though the webpage contents were corrupted by whatever program produced them. Nothing you do at this point can fix that.
Michael Borgwardt
Yes. they are correctly displayed in browser. That's why I'm sure it's possible, I just don't know how to do it. :)
Levara
@Levara: then, as I wrote, the problem is with whatever you do with the data after you have read it. *where* are the characters displayed as '?'
Michael Borgwardt
I'm displaying it in textview in android. It works now, it seems that the problem was that the encoding parameter should be "ISO8859-2" and not "ISO-8859-2"... thanks anyway.
Levara