tags:

views:

279

answers:

2
+2  A: 

226,128,147 is E2,80,93 in hex.

> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).  
["0 "," 1"]
Zed
You're right. Sorry for wasting everyone's time.
Justin
A: 

As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.

If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Warren Young
Whether Erlang uses wide characters or not depends on your preference. According to http://www.erlang.org/doc/man/unicode.html (added in R13), "[i]n lists, Unicode data is encoded as integers, each integer representing one character and encoded simply as the Unicode codepoint for the character". Of course, nothing stops you from putting UTF-8 data in lists, if that's appropriate for your program.Binaries can only contain bytes, usually Latin-1 or UTF-8.
legoscia
Thanks. Answer edited appropriately.
Warren Young