ansaurus

Question

How can I change extended latin characters to their unaccented ASCII equivalents?

Answer 1

+2 A:

Maybe a CPAN module might be of help?

Text::Unidecode looks promising, though it does not strip ‡ or Ω or ‰. Rather these are replaced by ++, O and %o. This might or might not be what you want.

Text::Unaccent, is another candidate but only for the part of getting rid of the accents.

2009-01-16 10:56:59

Answer 2

+6 A:

Use Unicode::Normalize to get the NFD($str). In this form all the characters with diacritics will be turned into a base character followed by a combining diacritic character. Then simply remove all the non-ASCII characters.

bobince 2009-01-16 11:05:17

That looks to be a useful suite of tools. Thanks. I just tried it on my sampled data. It failed to convert the strings as expected . Here's one example:"Eugène Legout-Gérard" remains exactly the same under UTF-8 with or without the NFD transformation. I'm trying to get "Eugene Legout-Gerard"

rwired 2009-01-16 12:01:13

Whether you print the normalized or form or the original doesn't make a difference. The difference is, though, that you can remove everything that is not ASCII from the normalized form which should give you what you want.

innaM 2009-01-16 15:01:35

Answer 3

A:

When I would like translate some string, not only chars, I'm using this approach:

my %trans = (
  'é' => 'e',
  'ê' => 'e',
  'á' => 'a',
  'ç' => 'c',
  'Ď' => 'D',
  map +($_=>''), qw(‡ Ω ‰)
};

my $re = qr/${ \(join'|', map quotemeta, keys %trans)}/;

s/($re)/$trans{$1}/ge;

If you want some more complicated you can use functions instead string constants. With this approach you can do anything what you want. But for your case tr should be more effective:

tr/éêáçĎ/eeacD/;
tr/‡Ω‰//d;

Hynek -Pichi- Vychodil 2009-01-16 11:39:56

Answer 4

+2 A:

Text::Unaccent or alternatively Text::Unaccent::PurePerl sounds like what you're asking for, at least the first half of it.

$unaccented = unac_string($charset, $string);

Removing all non-ASCII characters would be a relatively simple.

s/[^\000-\177]+//g;

Leon Timmermans 2009-01-16 13:04:31

Answer 5

A:

All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.

In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).

I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters

For interested parties the final code is: (the tr is a single line)

$word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
\xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
oooouuuuyy/;
$word =~ s/\xC6/AE/g;
$word =~ s/\xE6/ae/g;
$word =~ s/[^\x00-\x7F]+//g;

Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.

rwired 2009-01-17 11:38:00

ansaurus

tags:

views:

answers:

How can I change extended latin characters to their unaccented ASCII equivalents?

related questions