views:

43

answers:

2

So I got a site that uses ISO-8859-1 encoding and I can't change that. I want to be sure that the content I enter into the web app on the site gets parsed correctly. The parser works on a character by character basis. I also cannot change the parser, I am just writing files for it to handle. The content in my file I am telling the app to display after parsing contains Unicode characters (or at least I assume so, even if they were produced by Windows Alt Codes mapped to CP437). Using entities is not an option due to the character by character operation of the parser. The only characters that the parser escapes upon output are markup sensitive ones like ampersand, less than, and greater than symbols. I would just go ahead and put this through to see what it looks like, but output can only be seen on a publishing, which has to spend a couple days getting approved and such, and that would be asking too much for just a test case.

So, long story short, if I told a site to output ▼ÇÑ¥☺☻ on a site with a meta tag stating it is supposed to use ISO-8859-1, will a browser auto-detect the Unicode and display it or will it literally translate it as ISO-8859-1 and get a different set of characters?

UPDATE: I made a temporary test site at http://doorstop.csh.rit.edu/home/testing where I made the test file in Notepad++ using UTF-8 with no BOM but used a meta tag that set the encoding to ISO-8859-1.

+1  A: 

If you send UTF-8 to something told to expect ISO-8859-1, then yes, you'll be getting Mojibake :(

Consider that a UTF-8 sequence is introduced simply with an 8-bit char with the high bit set (i.e. a char value > 127). How is something expecting a simple 8 bit character encoding going to decide that a particular sequence should be interpreted as UTF-8 and not the encoding it was told to use?

Paul Dixon
Because you can supposedly detect a valid utf-8 string of bytes and have the probability that it is actually supposed to be encoded in ISO-8859-1 like the site says to be very very low. Go to http://en.wikipedia.org/wiki/UTF-8#Advantages and see citation sources #19 and #20. I just figured with browsers often rendering html based on the context of the content rather than the true HTML spec, will the browser render UTF-8 in the case of a UTF-8 likely context even though it by HTML spec should use ISO-8859-1?
grg-n-sox
A: 

The only characters that the parser escapes upon output are markup sensitive ones like ampersand, less than, and greater than symbols.

Anything outside ISO-8859-1 is likely to cause problems. HTML encoded as ISO-8859-1 can display the character like ▼☺☻, but only by escaping them as ▼☺☻. Otherwise, they're simply outside the range of the encoding.

The characters ÇÑ¥ are supported by ISO-8859-1 and should not cause a problem in a correctly implemented system.

Whether the parser could be used to parse the file correctly prior to display depends on its implementation and whether it and its web container respect any encoding metadata you might be able to send it.

Unicode is a character set supported by multiple encodings. For example, U+263a ☺ encoded as UTF-8 becomes the bytes e2 98 ba which would be decoded as ☺ if treated as ISO-8859-1.

McDowell