ansaurus

Question

Why do Perl string operations on Unicode characters add garbage to the string?

Answer 1

+7 A:

The problem is very likely down to not having

use utf8;

(or its equivalent for whatever coding system you are using) in your program. The weird replacements you have there look like problems with bytewise rather than characterwise regular expression replacement.

#!/usr/local/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, "utf8";
my $string = "été";

$string =~ s/[áàâã]/a/gi; #This line always prepends an "a"
$string =~ s/[éèêë]/e/gi;
$string =~ s/[úùûü]/u/gi;

print "$string\n";

prints

ete

If you are reading input from a file or from standard input, make sure you have the stream set to utf8 or whatever is appropriate for the encoding. For STDIN use

binmode STDOUT, "utf8";

If you are reading from a file, use

open my $file, "<:utf8", "file_name"

to get the encoding right. If it is not in UTF-8, use encoding(name) instead of utf8.

Kinopiko 2009-10-15 12:49:08

Given that Mike has 'use utf8;' in his source, the Unicode source code will be accepted just fine. That suggests that his input string is not being correctly interpreted. Bear in mind that the utf8 pragma affects program code not source.

Nic Gibson 2009-10-15 14:32:39

There's no mention in the post of where the input originates from.

Kinopiko 2009-10-15 15:01:21

Stream comes from an AJAX Request. See Edit 2

Mike 2009-10-16 07:17:35

Answer 2

+2 A:

Something tells me it's because it doesn't know how to behave with characters with accent. By looking at your Regex, everything seems fine.

Youmight want to check for this : use utf8;

David Brunelle 2009-10-15 12:50:23

Answer 3

+4 A:

This is probably due to the fact that you're using UTF8 strings, and it's parsing them as if they're not, or similar.

Instead of using something like [áàâã] you should probbaly use something like [\xE1-\xE5]

and probably use use utf8; in your code too

Mez 2009-10-15 12:51:24

Either one or the other is enough.

Kinopiko 2009-10-15 12:57:40

but there's no harm in using both :D

Mez 2009-10-15 13:18:48

Answer 4

+1 A:

This could also be a problem with Unicode Normalisation, as certain systems (I'm looking at you, OS X) represent extended Latin1 glyphs as a specific normalised representation that can break regular expressions when you refer to a character specifically instead of using a unicode or hex representation.

squeeks 2009-10-15 12:53:55

If Mike has "use utf8;" in his program, this problem will be resolved by Perl.

Kinopiko 2009-10-15 12:55:38

Answer 5

+5 A:

But did you really want to use regexes at all? Perhaps something like Text::Unidecode would be better

$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")'
ete

oylenshpeegul 2009-10-15 13:50:16

Note the importance of the utf8 pragma there. If you have Unicode in your source, you need to tell Perl that.

brian d foy 2009-10-15 23:42:56

Answer 6

+1 A:

I suspect that what is happening is that the [áàâã] part of your regex is not actually matching characters, but matching bytes. The UTF-8 encoding of those characters would look literally like this in the regex:

[\xC3\xA1\xC3\xA0\xC3\xA2\xC3\xA3]

And so when the regex is fed, for example , 'é' (\xC3\xA9), it looks at it a byte at a time, matches the \xC3, and replaces it with an 'a'. It does this for all of the \xC3 bytes it can find. So, 'été' is turned into 'a\xA9ta\xA9'.

Then the second regex, which looks like this:

[\xc3\xA9\xC3\xA8\xC3\xAA\xC3\xAB]

comes along, and it matches the \xA9 portion, and replaces it with an 'e'. So now, 'a\xA9ta\xA9' is turned into 'aetae'.

When you replace the [áàâã] with (á|à|â|ã), then that matches complete characters correctly on the first pass, but then your second regex has the original problem, and \xC3 characters are replaced with 'e' instead.

If this is still happening, even with use utf8;, then there may be a bug (or at least a limitation) in the perl regular expression engine. What version of perl are you using for this?

Ian Clelland 2009-10-15 17:16:09

perl -v returns: "This is perl, v5.10.0 built for i586-linux-thread-multi"

Mike 2009-10-16 07:16:31

Answer 7

A:

I'd say you shouldn't really use regular expressions here. The easiest way to achieve this (although this might be undesirable) would be to convert your input string into US ASCII. The appropriate conversion tables should know that e is the closest equivalent to é.

Another option would be to use Unicode and normalize your string into NFD. This will break up all accented letters into base letter + diacritic. Then you can just go through your string and remove all combining diacritical characters.

Joey 2009-10-16 05:33:01

ansaurus

tags:

views:

answers:

Why do Perl string operations on Unicode characters add garbage to the string?

Edit 1:

Edit2:

related questions