ansaurus

Question

Answer 1

+3 A:

You cannot seamlessly "convert" from encoding X to encoding Y without knowing encoding X beforehand. Just check the HTTP response header which encoding it is using (if you're obtaining those HTML pages by HTTP) and then use the appropriate encoding in your HTML parser tool.

BalusC 2010-02-16 13:07:05

Answer 2

+1 A:

Can i change from any character encoding to UTF-8?

Yes, you can express any Unicode character in UTF-8 encoding.

There might be a problem when changing the encoding of HTML pages: if the page contains an "charset" Meta-Tag, for example,

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

you have to update this tag so it corresponds to the actual encoding.

mfx 2010-02-16 15:23:44

He's parsing a HTML page, not generating a HTML page. Besides, this line actually **instructs** the client side which encoding to use to parse the given HTML page (and that is *exactly* the information the OP doesn't know about beforehand and thus need to find out in the response headers!).

BalusC 2010-02-16 15:34:41

Answer 3

+1 A:

Where do you get the HTML page from? If you get it from the servlet request, you can use getReader() on it and pass that to clean(). This will use the right encoding. If you get it from an upload, pass the input stream to clean(). If you get it by http client, you need to check the reponse header Content-Type using getResponseCharSet().

Arne Burmeister 2010-02-16 15:41:48

but if i read it as an URL? or as a file? what should do i?!

ehab refaat 2010-02-20 07:11:50

ansaurus

tags:

views:

answers:

java utf-8 encding problem

related questions