views:

2573

answers:

4

I've an ASCII file that contains an EM Dash (— or — in HTML). The hex value is 0x97. When we pass this file through one application it arrives as UTF-8, and it converts the character to 0xC297, which is — in HTML. However, when we pass this file through a different application it converts the character to 0xE28094 or —.

What would cause these applications to convert these characters differently? Is it perhaps a code page setting?

+2  A: 

According to the HTML4 specification's character entity reference, the emdash is — (U+2014).

R. Bemrose
+2  A: 

An ASCII file can not contain the character 0x97, as the ASCII character set only ranges from 0x00 to 0x7F. Therefore your file is not ASCII, but some other single byte encoding. The windows-1250 encoding for example has the em-dash at 0x97.

If the applications decode the text file using some other encoding than the one that was used to create the file, any character above 0x7F will be wrong.

In unicode the em-dash has the character code 0x2014, or 8212 in decimal.

Unicode Character 'EM DASH' (U+2014)

In a web page that for example uses windows-1250 as encoding, the code — will render as an em-dash:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
<html xmlns="http://www.w3.org/1999/xhtml"&gt;
<head>
    <title>em-dash</title>
    <meta http-equiv="content-type" content="text/html; charset=windows-1250"/>
</head>
<body>
    <div>&#151;</div>
</body>
</html>
Guffa
+5  A: 

&#151; is wrong. When you use numeric character references, the number refers to the Unicode codepoint. For numbers below 256 that is the same as the codepoint in ISO-8859-1. In 8859-1, character 151 is amongst the “C1 control codes”, and not a dash or any other visible character.

The confusion arises because character 151 is a dash in Windows code page 1252 (Western European). Many people think cp1252 is the same thing as ISO-8859-1, but in reality it's not: the characters in the C1 range (128 to 159) are different.

The first application is reading your “ASCII” file* as ISO-8859-1, but actually it's probably cp1252 and you'll need a way to clue the app in about what encoding it has to expect.

(*: “ASCII” is a misnomer if there are top-bit-set characters in the file. You probably mean “ANSI”, which is really also a misnomer, but one which has stuck in the Windows world to mean “text encoded in the current system-default code page”.)

bobince
+1  A: 
  • &#151; is not em dash, your text was mis-translated from em dash to that value.
  • &#8212; is the HTML decimal entity for em dash. Specifically it is referencing the Unicode code point 8212 which represents an em dash.
  • Your file is not ASCII if it contains an em dash. ASCII chars only encode to decimal range 0 - 127, and em dash is not a character that can be represented by ASCII encoding. If you have em dash stored as 0x97 (151 in decimal) you probably have an ANSI text file (aka Windows Codepage 1252 (w-1252)).

Your first app...
The data started as an em dash encoded in w-1252. In w-1252 the em dash maps to the decimal value 151 (0x97 in hex, or 10010111 in binary).

At some point the em dash was handled by code that thought the bytes in your file were iso-8859-1 encoded text. When that code interpreted 0x97 as a string/char it mapped 0x97 to a character according to the iso-8859-1 encoding. In iso-8859-1 0x97 maps to the char "End of guarded area".

Next, the string, which the code thinks is the "End of guarded area" control char, was encoded as utf-8. "End of guarded area" encoded in utf-8 is the two-byte sequence: 0xC2 0x97.

Your second app...
The text file was correctly interpreted as w-1252, thus the 0x97 is recognized as em dash, which was correctly encoded as the em dash in utf-8: 0xE2 0x80 0x94.

What influences this behavior
Not sure if you're dealing with web apps or what, but the concept should be the same whatever it is. We had the same 0x97->0xC297 scenario in a web app where people input data into a form. I found that the charset of the web page was declared as iso8859-1, and the browser's best way to handle the w1252 chars was to just send them along as as the iso bytes without alerting the user or the server. The server receives the data thinks it's iso and converts to utf-8, resulting in 0xC297.

Basically any time an app touches text it needs to be told how the text is encoded, or else it might fall back to a system default. If that happens you risk data corruption.