views:

88

answers:

3

i am using an HTML parser called HTMLCLEANER to parse HTML page the problem is that each page has a different encoding than the other. my question

Can i change from any character encoding to UTF-8?

+3  A: 

You cannot seamlessly "convert" from encoding X to encoding Y without knowing encoding X beforehand. Just check the HTTP response header which encoding it is using (if you're obtaining those HTML pages by HTTP) and then use the appropriate encoding in your HTML parser tool.

BalusC
+1  A: 

Can i change from any character encoding to UTF-8?

Yes, you can express any Unicode character in UTF-8 encoding.

There might be a problem when changing the encoding of HTML pages: if the page contains an "charset" Meta-Tag, for example,

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

you have to update this tag so it corresponds to the actual encoding.

mfx
He's parsing a HTML page, not generating a HTML page. Besides, this line actually **instructs** the client side which encoding to use to parse the given HTML page (and that is *exactly* the information the OP doesn't know about beforehand and thus need to find out in the response headers!).
BalusC
+1  A: 

Where do you get the HTML page from? If you get it from the servlet request, you can use getReader() on it and pass that to clean(). This will use the right encoding. If you get it from an upload, pass the input stream to clean(). If you get it by http client, you need to check the reponse header Content-Type using getResponseCharSet().

Arne Burmeister
but if i read it as an URL? or as a file? what should do i?!
ehab refaat