views:

33

answers:

1

I'm playing with bash, experiencing with utf-8 encoding. I'm new to unicode. The following command (well, their output) surprises me :

$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=
$ printf '1\né\n12\n123\n' | egrep '^(.|...)$'
1
é
12
$ touch 1 é 12 123
$ ls | egrep '^(.|...)$'
1
123

Ok. The two egrep filters lines with one or three characters. Their input is quite similar, but the output differs with the character é. Any explanation?

More details on my environment :

$ uname -a
Darwin macbook-pro-de-admin-6.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
$ egrep -V
egrep (GNU grep) 2.5.1

Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

A: 

Any variable length encoding can mess with tools that is not aware of the encoding, and considers bytes, not characters, when you use single-character wildcards (because the tool assumes that byte=character). If you use literal characters, then for UTF-8, it doesn't matter since the structure of UTF-8 prevents matches in the middle of a character (assuming proper encoding).

At least some versions of grep are supposed to be UTF-8 aware, according to http://mailman.uib.no/public/corpora/2006-December/003760.html, GNU grep 2.5.1 and later is included there as long as an appropriate LANG is set. If you use an older version, however, or something other than GNU grep, that would likely be the cause of your problem, since é is a two-byte character (0xC3 0xA9).

EDIT: Based on your recent comment, your grep is probably Unicode-aware, but it does not perform any sort of Unicode normalization (and I wouldn't really expect it to, to be honest).

0x65 0xCC 0x81 is an e, followed by COMBINING ACUTE ACCENT (U+0301). This is effectively two characters, but it's rendered as one due to the semantics of combining characters. This then causes grep to detect it as two characters; one for the e and one for the accent.

It seems likely that decomposed Unicode is how the file name is actually stored in your file system - otherwise, you could store files that, for all intent and purposes, have the exact same name, but only differ in their use of combining characters.

Michael Madsen
Well, that was fast ;o)
denis