views:

70

answers:

3

I was working on parsing an excel file having japanese files in some of the cells.

By using Spreadsheet::ParseExcel (Ver. 0.15) (which I know is older than current version)

some of the cells with below characters

<設定B-1コース>

are appearing as in

print Dumper $oWkc->{_Value};

$VAR1 = "\x{ff1c}\x{8a2d}\x{5b9a}B-\x{ff11}\x{30b3}\x{30fc}\x{30b9}\x{ff1e}";

and

print $oWkc->{Val} . "\n";

[-0

$VAR1 = "\x{ff1c}\x{8a2d}\x{5b9a}B-\x{ff13}\x{30b3}\x{30fc}\x{30b9}\x{ff1e}"; [-0 If I want to get these values printed in actual foramat, I am setting the STDOUT File handle to ":utf8". and also my terminal to point to utf 8 encoding, (otherwise i am getting some "wide character" warning). Here I have to pick cells with B-1 or B-2 , but I am not sure what should be set inside my script so that these characters can be treated as what I am able to see them on STDOUT.

Currently I am using regular expression to convert these wide characters to corresponding ascii value. as an example if I want to match B-1 which is stored as 'B-\x{ff11}' , I will be

$oWkc->{_Value} =~ /([AB]-)(\x{ff11}|\x{ff12}|\x{ff13}/
my $lookup = $1.$2;
$lookup =~ s/\x{ff11}/1/;
$lookup =~ s/\x{ff12}/2/;
$lookup =~ s/\x{ff13}/3/;

For reference, B-1, A-2 etc these values are coming from some other source, and currently are ranging from A|B-[1-3].

What is the standard way to deal with these wide characters. I am not able to use encode/decode etc . Can any one give me some direction .. Currently though I am able to get the work done using Regex ...

+3  A: 

While I did not verify it (I am not going to install a module from March 2001), the module apparently already decodes to Perl native strings, so you do not have to do much. The straightforward way works just fine, no need to overcomplicate things by those substitutions.

use utf8;
my $val = '<設定B-1コース>';

# does it match A or B, followed by a dash, followed by a fullwidth 1,2 or 3?
$val =~ /(?:A|B)-[123]/;  # returns true/1
daxim
Thanks for replying,But my $val = '<設定B-1コース>';does not appear any where in perl, its copied and pasted from sheet opened in excel.instead the values stored in perl object are stored as either wide character codes as given in my comment 1 or those '[-0' dummy values.
awake416
For now What I am looking for is some way to get all the values which appear in ascii range, in wide character codes to be converted to corresponding ascii's so that I can use regex to match and fetch those rows for further processing in my application.
awake416
My `$val` dumps to exactly the same representation which you have written in the question. You call it wide character codes, but it's really just native Perl strings. - For replacing fullwidth digits with ASCII digits, just do `use utf8; $val =~ tr[0-9][0-9];`.
daxim
In that case, It should print $oWkc->{_Value} =~ tr[0-9][0-9]; print $oWkc->{_Value} . "\n" if ($oWkc->{_Value} =~ /B-1/);When there are cells with B-1 in them, but I am not getting any thing there.
awake416
You are victim of your lack of copy and paste. I wrote `tr[0-9][0-9];`, not `tr[0-9][0-9];`. They are different, and only the first one works as intended.
daxim
Thanks, I will give it a try and trouble you if needed :-) .. BTW thanks for your quick replies ...
awake416
+2  A: 

To deal with multi-byte characters in Spreadsheet::ParseExcel you should update to the latest version and use the FmtJapan formatter. Several bug fixes around Japanese formatting went into recent versions.

Here is an example:

#!/usr/bin/perl


use warnings;
use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::FmtJapan;

my $filename  = 'Test2000J.xls';
my $parser    = Spreadsheet::ParseExcel->new();
my $formatter = Spreadsheet::ParseExcel::FmtJapan->new();
my $workbook  = $parser->parse($filename, $formatter);

if ( !defined $workbook ) {
    die "Parsing error: ", $parser->error(), ".\n";
}

# Set your output encoding.
binmode STDOUT, ':encoding(cp932)';
# Or maybe this:
#binmode STDOUT, ':utf8';


for my $worksheet ( $workbook->worksheets() ) {

    print "Worksheet name: ", $worksheet->get_name(), "\n\n";

    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    for my $row ( $row_min .. $row_max ) {
        for my $col ( $col_min .. $col_max ) {

            my $cell = $worksheet->get_cell( $row, $col );
            next unless $cell;

            print "    Row, Col    = ($row, $col)\n";
            print "    Value       = ", $cell->value(),       "\n";
            print "    Unformatted = ", $cell->unformatted(), "\n";
            print "\n";
        }
    }
}
jmcnamara
A: 

This works for me. I am able to getting result now. But what are these characters '0-9'. Can I get more details about these or may be some doc/link. Also I want to know more about how perl stores data internally.

awake416
These characters are [fullwidth digits](http://www.fileformat.info/info/unicode/block/halfwidth_and_fullwidth_forms/list.htm). You are not supposed to care how Perl stores strings internally, just use the exposed APIs, but if you really care: it is an [well-documented open secret](http://p3rl.org/UNI). If any of that information is not clear enough, then [open a new question](http://stackoverflow.com/questions/ask).
daxim
Thanks Daxim .. You are rocking ..!!
awake416