views:

81

answers:

1

We all know how easy character sets are on the web, yet every time you think you got it right, a foreign charset bites you in the butt. So I'd like to trace the steps of what happens in a fictional scenario I will describe below. I'm going to try and put down my understanding as well as possible but my question is for you folks to correct any mistakes I make and fill in any BLANKs.

When reading this scenario, imagine that this is being done on a Mac by John and on Windows by Jane and add comments if one behaves differently than the other in any particular situation.

Our hero (John/Jane) starts by writing a paragraph in Microsoft Word. Word's charset is BLANK1 (CP1252?).

S/he copies the paragraph, including smart quotes (e.g. “ ”). The act of copying is done by the BLANK2 (Operating system...Windows/Mac?) which BLANK3 (detects what charset the application is using and inherits the charset?). S/he then pastes the paragraph in a text box at StackOverflow.

Let's assume StackOverflow is running on Apache/PHP and that their set up in httpd.conf does not specify AddDefaultCharset utf-8 and their php.ini sets the default_charset to ISO-8859-1.

Yet neither charset above matters, because Stack Overflow's header contains this statement META http-equiv="Content-Type" content="text/html; charset=UTF-8", so even though when you clicked on "Ask Question" you might have seen a *RESPONSE header in firebug of "Content-type text/html;" ... in fact, Firefox/IE/Opera/Other browsers BLANK4 (completely 100% ignore the server header and override it with the Meta Content-type declaration in the header? Although it must read the file before knowing the Content-type, since it doesn't have to do anything with the encoding until it displays the body, this makes no different to the browser?).

Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters. BLANK5 (If someone can go into excruciating detail about what the browser does in this step, it would be very helpful...here's my understanding...since the operating system controls the clipboard and display of the character in the form, it inserts the character in whatever charset it was copied from. And displays it in the form as that charset...OVERRIDING the UTF-8 in this example).

Let's assume the form method=GET rather than post so we can play w/ the URL browser input.... Continuing our story, the form is submitted as UTF-8. The smart quotes which represent decimal code 147 & 148, when the browser converts them to UTF-8, it gets transformed into BLANK6 characters.

Let's assume that after submission, Stack Overflow found an error in the form, so rather than displaying the resulting question, it pops back up the input box with your question inside the form. In the php, the form variables are escaped with htmlspecialchars($var) in order for the data to be properly displayed, since this time it's the BLANK7 (browser controlling the display, rather than the operating system...therefore the quotes need to be represented as its UTF-8 equivalent or else you'd get the dreaded funny looking � question mark?)

However, if you take the smart quotes, and insert them directly in the URL bar and hit enter....the htmlspecialchars will do BLANK8, messing up the form display and inserting question marks �� since querying a URL directly will just use the encoding in the url...or even a BLANK9 (mix of encodings?) if you have more than one in there...

When the REQUEST is sent out, the browser lists acceptable charsets to the browser. The list of charsets comes from BLANK10.

Now you might think our story ends there, but it doesn't. Because StackOverflow needs to save this data to a database. Fortunately, the people running this joint are smart. So when their MySQL client connects to the database, it makes sure the client and server are talking to each other UTF-8 by issuing the SET NAMES UTF-8 command as soon as the connection is initiated. Additionally, the default character set for MySQL is set to UTF-8 and each field is set the same way.

Therefore, Stack Overflow has completely secured their website from dB injections, CSRF forgeries and XSS site scripting issues...or at least those borne from charset game playing.

*Note, this is an example, not the actual response by that page.

+1  A: 

I don't know if this "answers" your "question", but I can at least help you with what I think may be a critical misunderstanding.

You say, "Since the Meta Content-type of the page is UTF-8, the input form will convert any characters you type into the box, into UTF-8 characters." There is no such thing as a "UTF-8 character", and it isn't true or even meaningful to think of the form "converting" anything into anything when you paste it. Characters are a completely abstract concept, and there's no way of knowing (without reading the source) how a given program, including your web browser, decides to implement them. Since most important applications these days are Unicode-savvy, they probably have some internal abstraction to represent text as Unicode characters--note, that's Unicode and not UTF-8.

A piece of text, in Unicode (or in any other character set), is represented as a series of code points, integers that are uniquely assigned to characters, which are named entities in a large database, each of which has any number of properties (such as whether it's a combining mark, whether it goes right-to-left, etc.). Here's the part where the rubber meets the road: in order to represent text in a real computer, by saving it to a file, or sending it over the wire to some other computer, it has to be encoded as a series of bytes. UTF-8 is an encoding (or a "transformation format" in Unicode-speak), that represents each integer code point as a unique sequence of bytes. There are several interesting and good properties of UTF-8 in particular, but they're not relevant to understanding, in general, what's going on.

In the scenario you describe, the content-type metadata tells the browser how to interpret the bytes being sent as a sequence of characters (which are, remember, completely abstract entities, having no relationship to bytes or anything). It also tells the browser to please encode the textual values entered by the user into a form as UTF-8 on the way back to the server.

All of these remarks apply all the way up and down the chain. When a computer program is processing "text", it is doing operations on a sequence of "characters", which are abstractions representing the smallest components of written language. But when it wants to save text to a file or transmit it somewhere else, it must turn that text into a sequence of bytes.

We use Unicode because its character set is universal, and because the byte sequences it uses in its encodings (UTF-8, the UTF-16s, and UTF-32) are unambiguous.

P.S. When you see �, there are two possible causes.

1) A program was asked to write some characters using some character set (say, ISO-8859-1) that does not contain a particular character that appears in the text. So if text is represented internally as a sequence of Unicode code points, and the text editor is asked to save as ISO-8859-1, and the text contains some Japanese character, it will have to either refuse to do it, or spit out some arbitrary ISO-8859-1 byte sequence to mean "no puedo".

2) A program received a sequence of bytes that perhaps does represent text in some encoding, but it interprets those bytes using a different encoding. Some byte sequences are meaningless in that encoding, so it can either refuse to do it, or just choose some character (such as �) to represent each unintelligible byte sequence.

P.P.S. These encode/decode dances happen between applications and the clipboard in your OS of choice. Imagine the possibilities.


In answer to your comments:

It's not true that "Word uses CP1252 encoding"; it uses Unicode to represent text internally. You can verify this, trivially, by pasting some Katakana character such as サ into Word. Windows-1252 cannot represent such a character.

When you "copy" something, from any application, it's entirely up to the application to decide what to put on the clipboard. For example, when I do a copy operation in Word, I see 17 different pieces of data, each having a different format, placed into the clipboard. One of them has type CF_UNICODETEXT, which happens to be UTF-16.

Now, as for URLs... Details are found here. Before sending an HTTP request, the browser must turn a URL (which can contain any text at all) into an IRI. You convert a URL to an IRI by first encoding it as UTF-8, then representing UTF-8 bytes outside the ASCII printable range by their percent-escaped forms. So, for example, the correct encoding for http://foo.com/dir1/引き割り.html is http://foo.com/dir1/%E5%BC%95%E3%81%8D%E5%89%B2%E3%82%8A.html . (Host names follow different rules, but it's all in the linked-to resource).

Now, in my opinion, the browser ought to show plain old text in the location bar, and do all of the encoding behind the scenes. But some browsers make stupid choices, and they show you the IRI form, or some chimera of a URL and an IRI.

Jonathan Feinberg
Thanks Jonathan. Very interesting response. I still think I'm missing something though. Let's start with smart quotes which come from a Word document. Since Word uses CP1252 encoding, the open quote here: http://en.wikipedia.org/wiki/Windows-1252 is Unicode code point 201C, correct? And the cp1252 code point is 147, correct? So when I copy “ into this box, does the OS translate the Unicode code point 201C to it's internal charset to represent the image?
joedevon
Ran out of room for Question 2. Why is it that when I paste that quote “ into an input form that sends out a GET request, it gets converted into a different character on the URL. Whereas if I type that character into the URL box, it does not get translated at all? Therefore these two requests are sent to the server as different byte sequences. Which is what made me think that the form was doing something. Like submitting through a form causes a URL encoding whereas typing the URL skips URL encoding.
joedevon
I've answered your comments at the bottom of the big text, above.
Jonathan Feinberg
P.S. Clipboard Spy is here: http://www.geocities.com/SiliconValley/2060/clipspy.html
Jonathan Feinberg
I plan to revisit this thread again and reread it to make sure I got it all straight...
joedevon