This input text often has characters which are incorrect for the output encoding, things like "smart quotes", which come from a document in Windows-1252 encoding
“Smart quotes” (bytes 147 and 148 in cp1252) are perfectly valid Unicode characters, U+201C and U+201D. Your application should be capable of handling them seamlessly; if not, you're doing something wrong and most likely all non-ASCII characters will fail.
Regardless of whether the characters came from someone typing them or someone pasting them in from Word, the browser should be submitting UTF-8-encoded characters to your application, which should be storing the same UTF-8 bytes to the database.
If the browser is not submitting in UTF-8, chances are you're failing to set the charset of the HTML page containing the form. This can be done using the:
Content-Type: text/html;charset=utf-8
HTTP header and/or the:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
element in <head>.
Can I just set the accept-charset attribute on the form and have the browser do it for me?
No, accept-charset is basically useless thanks to IE, which misinterprets it to mean “try using this charset if the one on the page can't encode the characters we want”, instead of “always use this charset”. This means if you use accept-charset you can end up with a mixture of encodings submitted at once, with no way to figure out which is which. Nice!
how come my database is accepting these characters, which are reserved/control characters in UTF-8?
In MySQL UTF-8 is just a collation, used for comparison and ordering. It's still storing the data as bytes and doesn't really care if they're not valid UTF-8 sequences.
It's a good idea to decode and check incoming UTF-8 sequences in your app anyway, because “short sequences”, invalid in modern Unicode, can hide a ‘<’ character that will still be recognised by older browsers (at least IE6 pre-SP2, Opera 7).
ETA:
So, I entered a string containing byte 146
No, you entered a Unicode character U+201B. The browser deals with Unicode characters, not bytes, right up until the point it has to submit the serialised form to the server. It's then that it decides how to turn the characters into bytes, and if the page is being handled as UTF-8, it will always choose UTF-8.
(If it's not UTF-8, browsers tend to cheat in a non-standards-compliant way: for all characters that can't fit in the encoding, it'll encode them to HTML character references like ‘’’. This is wrong because you now can't tell the difference between a browser-escaped ‘&’ and a real, user-typed ‘&’, and it's insidiously wrong because if you then echo the reference as unescaped HTML it looks like you're getting it right, which in fact you've just made a big old security hole.)
It went into the database as 146
Really, a ‘\x92’ byte, not ‘\xC2\x92’, ‘\xE2\x80\x99’ or ‘’’?
it came out when I produced the (UTF-8-encoded) XML, as 146. No complaints from the browser
Then it did not come out as a single 146-byte. A browser will complain when given a bare ‘\x92’ in an XML file. (Not an HTML file, in which invalid UTF-8 sequences come out as a missing-character glyph.)
I suspect it is coming out as a ‘’’ character reference, which is well-formed (though the character U+0092 is part of the C1 control set, so won't render as anything useful). If this is what's happening, your form page is not being picked up as UTF-8 after all, and you're suffering the browser-auto-escaping-submission problem described above.