ansaurus

Question

Answer 1

+1 A:

It's html from a website.

Use a HTML parser and this problem and all future potential problems will disappear.

I can recommend picking Jsoup for the job.

When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.

But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.

In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.

Alan Moore 2010-08-08 06:46:04

Thank you. That was it!

monoceres 2010-08-08 10:15:30

+1. Nice one, didn't even know about this. So even with platforms that normally use Unicode only there is some margin of serious screwup.

Joey 2010-08-11 11:46:15

ansaurus

tags:

views:

answers:

Regex and ISO-8859-1 charset in java

See also:

related questions