I can do it in vim like so:
:%s/\%u2013/-/g
How do I do the equivalent in Perl? I thought this would do it but it doesn't seem to be working:
perl -i -pe 's/\x{2013}/-/g' my.dat
I can do it in vim like so:
:%s/\%u2013/-/g
How do I do the equivalent in Perl? I thought this would do it but it doesn't seem to be working:
perl -i -pe 's/\x{2013}/-/g' my.dat
Hmm, a bit tough. This seems to do it (Perl 5.10.0 on MacOS X 10.6.2):
perl -w -e "
use open ':encoding(utf8)';
use open ':std';
while (<>)
{
s/\x{2013}/-/g;
print;
}
"
I have not yet minimized that. See perldoc on the 'use open' statement.
Judging from my (limited) experiments, the '-p' option doesn't recognize the 'use open' directives. You can use 'qw()' to quote the words:
perl -w -e "
use open qw( :encoding(utf8) :std );
while (<>)
{
s/\x{2013}/-/g;
print;
}
I don't know if '-p' not obeying 'use open' is a bug or a design feature.
For a generic solution, Text::Unidecode transliterate pretty much anything that's thrown at it into pure US-ASCII.
So in your case this would work:
perl -C -MText::Unidecode -n -i -e'print unidecode( $_)' unicode_text.txt
The -C is there to make sure the input is read as utf8
It converts this:
l'été est arrivé à peine après aôut
¿España es un paìs muy lindo?
some special chars: » « ® ¼ ¶ – – — Ṉ
Some greek letters: β ÷ Θ ¬ the α and ω (or is it Ω?)
hiragana? みせる です
Здравствуйте
السلام عليكم
into this:
l'ete est arrive a peine apres aout
?Espana es un pais muy lindo?
some special chars: >> << (r) 1/4 P - - -- N
Some greek letters: b / Th ! the a and o (or is it O?)
hiragana? miseru desu
Zdravstvuitie
lslm `lykm
The last one shows the limits of the module, which can't infer the vowels and get as-salaamu `alaykum from the original arabic. It's still pretty good I think
This did the trick for me:
perl -C1 -i -pe 's/–/-/g' my.dat
Note that the first bar is the \x{2013} character itself.