views:

331

answers:

3

I can do it in vim like so:

:%s/\%u2013/-/g

How do I do the equivalent in Perl? I thought this would do it but it doesn't seem to be working:

perl -i -pe 's/\x{2013}/-/g' my.dat
+1  A: 

Hmm, a bit tough. This seems to do it (Perl 5.10.0 on MacOS X 10.6.2):

perl -w -e "
use open ':encoding(utf8)';
use open ':std';

while (<>)
{
    s/\x{2013}/-/g;
    print;
}
"

I have not yet minimized that. See perldoc on the 'use open' statement.


Judging from my (limited) experiments, the '-p' option doesn't recognize the 'use open' directives. You can use 'qw()' to quote the words:

perl -w -e "
use open qw( :encoding(utf8) :std );
while (<>)
{
    s/\x{2013}/-/g;
    print;
}

I don't know if '-p' not obeying 'use open' is a bug or a design feature.

Jonathan Leffler
Yes, it would be interesting to know why -p does not work.
stephenmm
+12  A: 

For a generic solution, Text::Unidecode transliterate pretty much anything that's thrown at it into pure US-ASCII.

So in your case this would work:

perl -C -MText::Unidecode -n -i -e'print unidecode( $_)' unicode_text.txt

The -C is there to make sure the input is read as utf8

It converts this:

l'été est arrivé à peine après aôut
¿España es un paìs muy lindo?
some special chars: » « ® ¼ ¶ – – — Ṉ
Some greek letters: β ÷ Θ ¬ the α and ω (or is it Ω?)
hiragana? みせる です
Здравствуйте
السلام عليكم

into this:

l'ete est arrive a peine apres aout
?Espana es un pais muy lindo?
some special chars: >> << (r) 1/4 P - - -- N
Some greek letters: b / Th ! the a and o (or is it O?)
hiragana? miseru desu
Zdravstvuitie
lslm `lykm

The last one shows the limits of the module, which can't infer the vowels and get as-salaamu `alaykum from the original arabic. It's still pretty good I think

mirod
+1  A: 

This did the trick for me:

perl -C1 -i -pe 's/–/-/g' my.dat

Note that the first bar is the \x{2013} character itself.

Leon Timmermans
Some explanation of the '-C1' would do wonders. The information is available at http://perldoc.perl.org/perlrun.html (-C1 means 'standard input is in UTF8').
Jonathan Leffler