tags:

views:

87

answers:

2

I want to implement my own tweet compressor. Basically this does the following. However I'm stuck with some of the unicode issues.

Here's my script:

#!/usr/bin/env perl
use warnings;
use strict;

print tweet_compress('cc ms ns ps in ls fi fl ffl ffi iv ix vi oy ii xi nj/, "\. " ,", "'),"\n";

sub tweet_compress {
    my $tweet = shift;
    $tweet =~ s/\. ?$//;
    my @orig = ( qw/cc ms ns ps in ls fi fl ffl ffi iv ix vi oy ii xi nj/, ". " ,", ");
    my @new = qw/㏄ ㎳ ㎱ ㎰ ㏌ ʪ fi fl ffl ffi ⅳ ⅸ ⅵ ѹ ⅱ ⅺ nj . ,/;
    $tweet =~ s/$orig[$_]/$new[$_]/g for 0 .. $#orig;
    return $tweet;
}

But this prints junk out at the terminal:

?.?.?.?.?.?.?.f.?.f?.?.?.?.?.?.?.nj/."\..,"."

What am I doing wrong?

+1  A: 

Tell perl you're using unicode characters in your script with use utf8.

Pedro Silva
+6  A: 

Two issues.

Firstly you have unicode characters in your source code. Make sure you save your file as utf8 and use the use utf8 pragma.

Also if you intend to run this program from a console make sure it can handle unicode. Windows command prompt cannot and will always show ? regardless of whether your data is correct or not. I ran this on Mac OS with Terminal set to handle utf8.

Secondly, if you have "." in your orig list, it'll get interpreted as "any single character" and give you wrong results - so you need to escape it before using it in your regular expression. I've modified the program a little to make it work.

#!/usr/bin/env perl
use warnings;
use strict;
use utf8; #use character semantics

#make sure the data is re-encoded to utf8 when output to terminal
binmode STDOUT, ':utf8';

print tweet_compress('cc ms ns ps in ls fi fl ffl ffi iv ix vi oy ii xi nj/, "\. " ,", "'),"\n";

sub tweet_compress {
    my $tweet = shift;
    $tweet =~ s/\. ?$//;
    my @orig = ( qw/cc ms ns ps in ls fi fl ffl ffi iv ix vi oy ii xi nj/, '\. ' ,", ");
    my @new = qw/㏄ ㎳ ㎱ ㎰ ㏌ ʪ fi fl ffl ffi ⅳ ⅸ ⅵ ѹ ⅱ ⅺ nj . ,/;
    $tweet =~ s/$orig[$_]/$new[$_]/g for 0 .. $#orig;
    return $tweet;
}
deepakg
You've got an error on the `binmode` line -- looks like the word "console" got left behind from an earlier edit.
Bill Odom
Thanks. I've fixed it.
deepakg
Don't manually escape stuff in `@orig`, just use `quotemeta` (or `\Q` in the `s///`) :)
hobbs
"Windows command prompt cannot and will always show ? regardless of whether your data is correct or not." - depends on the code page.
Kinopiko
I was speaking in context of utf8 here because that's how the data in this case is stored. Besides I doubt that there is a single codepage which will cover all the characters we have here. There is a way to coax it to show utf8 using chcp and such (http://illegalargumentexception.blogspot.com/2009/04/i18n-unicode-at-windows-command-prompt.html) but it's such a slippery slope that you can never be sure whether console is to blame or your code. Better to eliminate console from the equation altogether.
deepakg
@hobbs - thanks for the quotemeta tip.
deepakg
what hobbs said. This line: $tweet =~ s/$orig[$_]/$new[$_]/g for 0 .. $#orig;should just be $tweet =~ s/\Q$orig[$_]\E/$new[$_]/g for 0 .. $#orig; and don't bother escaping any of the original strings
singingfish