ansaurus

Question

Answer 1

+2 A:

226,128,147 is E2,80,93 in hex.

> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).  
["0 "," 1"]

Zed 2009-09-24 16:29:23

You're right. Sorry for wasting everyone's time.

Justin 2009-09-24 17:02:20

Answer 2

A:

As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.

If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Warren Young 2009-09-25 16:49:26

Whether Erlang uses wide characters or not depends on your preference. According to http://www.erlang.org/doc/man/unicode.html (added in R13), "[i]n lists, Unicode data is encoded as integers, each integer representing one character and encoded simply as the Unicode codepoint for the character". Of course, nothing stops you from putting UTF-8 data in lists, if that's appropriate for your program.Binaries can only contain bytes, usually Latin-1 or UTF-8.

legoscia 2009-09-25 18:03:52

Thanks. Answer edited appropriately.

Warren Young 2009-09-25 18:35:00

ansaurus

tags:

views:

answers:

Parsing \"–\" with Erlang re

related questions

ansaurus

tags:

views:

answers:

Parsing \"&ndash;\" with Erlang re

related questions

Parsing \"–\" with Erlang re