i am using an HTML parser called HTMLCLEANER to parse HTML page the problem is that each page has a different encoding than the other. my question
Can i change from any character encoding to UTF-8?
i am using an HTML parser called HTMLCLEANER to parse HTML page the problem is that each page has a different encoding than the other. my question
Can i change from any character encoding to UTF-8?
You cannot seamlessly "convert" from encoding X to encoding Y without knowing encoding X beforehand. Just check the HTTP response header which encoding it is using (if you're obtaining those HTML pages by HTTP) and then use the appropriate encoding in your HTML parser tool.
Can i change from any character encoding to UTF-8?
Yes, you can express any Unicode character in UTF-8 encoding.
There might be a problem when changing the encoding of HTML pages: if the page contains an "charset" Meta-Tag, for example,
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
you have to update this tag so it corresponds to the actual encoding.
Where do you get the HTML page from? If you get it from the servlet request, you can use getReader() on it and pass that to clean()
. This will use the right encoding. If you get it from an upload, pass the input stream to clean()
. If you get it by http client, you need to check the reponse header Content-Type
using getResponseCharSet()
.