views:

350

answers:

1

We have a website that uses classic asp.

Part of our release process substitures values in a file and we found a bug in it where it will write the file out as UTF8.

This then causes our application to start spitting out garbage. Apostrophes get returned as some encoded characters.

If we then go an remove the BOM that says this file is UTF8 then the text that was previously rendered as garbage is now displayed correctly.

Is there something that IIS does differently when it encounters UTF8 a file?

A: 

UTF-8 does not use BOMs; it is an annoying misfeature in some Microsoft software that puts them there. You need to find what step of your release process is putting a UTF-8-encoded BOM in your files and fix it — you should stop that even if you are using UTF-8, which really these days is best.

But I doubt it's IIS causing the display problem. More likely the browser is guessing the charset of the final displayed page, and when it sees bytes that look like they're UTF-8 encoded it guesses the whole page is UTF-8. You should be able to stop it doing that by stating a definitive charset by using an HTTP header:

Content-Type: text/html;charset=iso-8859-1

and/or a meta element in the HTML

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />

Now (assuming ISO-8859-1 is actually the character set your data are in) it should display OK. However if your file really does have a UTF-8-encoded BOM at the start, you'll now see that as ‘’ in your page, which is what those bytes look like in ISO-8859-1. So you still need to get rid of that misBOM.

bobince
Right this makes sense.It was actually a bug in some code that was written specifically to handle this kind of issue.Thanks.
Derek Ekins
I must admit this answer confuses me. "UTF-8 does not use BOMs" could you eloborate? In what way is this a "misfeature" ? I've never come across a problem using UTF-8 files that include this zero width space character, what problems have you encountered?
AnthonyWJones
Any bytes-based text tool (such as shells, config file loaders etc.) will immediately fall over when presented with “” at the start of a file; it is the explicit aim of UTF-8 to be compatible with tools that know nothing about Unicode, but UTF-8+BOM breaks this. Even some Unicode-aware tools will trip over it because a BOM is only expected to be present and automatically removed by the Unicode decoding process for UTF-16. UTF-8+BOM breaks applications and there is no justification for using it in the Unicode specs; and there isn't even any benefit to it as UTF-8 has no byte order issues.
bobince