ansaurus

Question

How can I substitute Unicode characters with ASCII in Perl?

Answer 1

+1 A:

Hmm, a bit tough. This seems to do it (Perl 5.10.0 on MacOS X 10.6.2):

perl -w -e "
use open ':encoding(utf8)';
use open ':std';

while (<>)
{
    s/\x{2013}/-/g;
    print;
}
"

I have not yet minimized that. See perldoc on the 'use open' statement.

Judging from my (limited) experiments, the '-p' option doesn't recognize the 'use open' directives. You can use 'qw()' to quote the words:

perl -w -e "
use open qw( :encoding(utf8) :std );
while (<>)
{
    s/\x{2013}/-/g;
    print;
}

I don't know if '-p' not obeying 'use open' is a bug or a design feature.

Jonathan Leffler 2010-02-22 06:58:55

Yes, it would be interesting to know why -p does not work.

stephenmm 2010-02-22 16:21:08

Answer 2

+12 A:

For a generic solution, Text::Unidecode transliterate pretty much anything that's thrown at it into pure US-ASCII.

So in your case this would work:

perl -C -MText::Unidecode -n -i -e'print unidecode( $_)' unicode_text.txt

The -C is there to make sure the input is read as utf8

It converts this:

l'été est arrivé à peine après aôut
¿España es un paìs muy lindo?
some special chars: » « ® ¼ ¶ – – — Ṉ
Some greek letters: β ÷ Θ ¬ the α and ω (or is it Ω?)
hiragana? みせる です
Здравствуйте
السلام عليكم

into this:

l'ete est arrive a peine apres aout
?Espana es un pais muy lindo?
some special chars: >> << (r) 1/4 P - - -- N
Some greek letters: b / Th ! the a and o (or is it O?)
hiragana? miseru desu
Zdravstvuitie
lslm `lykm

The last one shows the limits of the module, which can't infer the vowels and get as-salaamu `alaykum from the original arabic. It's still pretty good I think

mirod 2010-02-22 08:50:57

Answer 3

+1 A:

This did the trick for me:

perl -C1 -i -pe 's/–/-/g' my.dat

Note that the first bar is the \x{2013} character itself.

Leon Timmermans 2010-02-22 12:04:07

Some explanation of the '-C1' would do wonders. The information is available at http://perldoc.perl.org/perlrun.html (-C1 means 'standard input is in UTF8').

Jonathan Leffler 2010-02-22 16:43:12

ansaurus

tags:

views:

answers:

How can I substitute Unicode characters with ASCII in Perl?

related questions