views:

206

answers:

3

If there is no special character(such as white space, : etc) between firstname and lastname.

Then how to split the Chinese characters below.

use strict; 
use warnings; 
use Data::Dumper;  

my $fh = \*DATA;  
my $fname; # 小三; 
my $lname; # 张 ;
while(my $name = <$fh>)
{

    $name =~ ??? ;
    print $fname"/n";
    print $lname;

}

__DATA__  
张小三

Output

小三
张

[Update]

WinXP. ActivePerl5.10.1 used.

A: 

This splits the characters and assigns them to $fname and $lname.

my ($fname, $lname) = $name =~ m/ ( \X ) /gx;

Though I think your example and your question don't really match (the lastname has two characters.

Leon Timmermans
@Leon Timmermans, I corrected my post. Actually the last name should be one character(张), and there are several special last name have two characters. (欧阳,司马 etc.). Thank you.
Nano HE
I think it's `my ($lname,$fname)` as the family name comes first. The real problem is that some last names are two characters, so it's not obvious how to split.
rjh
Actually, that splits on graphemes, which isn't the same thing as characters. A single grapheme can be multiple characters.
brian d foy
A: 

You'll need some kind of heuristic to separate the first and last names. Here's some working code that assumes that the last name (surname) is one character (the first) and all the remaining characters (at least one) belong to the first name (given name):

EDIT: Changed program to ignore invalid lines rather than dying.

use strict;
use utf8;

binmode STDOUT, ":utf8";

while (my $name = <DATA>) {
    my ($lname, $fname) = $name =~ /^(\p{Han})(\p{Han}+)$/ or next;
    print "First name: $fname\nLast name: $lname\n";
}

__DATA__  
张小三

When I run this program from the command line, I get this output:

First name: 小三
Last name: 张
Sean
@Sean, I tested your script using utf8 encode template, but Ouptput below`First name: 小三``Last name: 张``Invalid name ""`
Nano HE
@Nano HE: I edited the program a bit to be more lenient in its input, and also added the output I see when I run it directly. The fact that you get six characters of gibberish for the first name (of two characters) and three characters of gibberish for the last name (of one character) leads me to suspect that the program is generating the correct UTF-8 output, but whatever environment you're using to run the program isn't interpreting the output correctly as UTF-8.
Sean
@Sean, I didn't present enough information at my original post. I run my Perl script with ActivePerl5.10.1 on winxp. Do you test your script on Linux? I still failed to get the output above.
Nano HE
@Nano HE, I ran my program on OS X, inside an Emacs shell. I set the coding system for the shell's output to utf-8, and I copied the output above directly from the shell buffer.
Sean
+1  A: 

You have problems because you neglect to decode binary data to Perl strings during input and encode Perl strings to binary data during output. The reason for this is that regular expressions and its friend split work properly on Perl strings.

(?<=.) means "after the first character". As such, this program will not work correctly on 复姓/compound family names; keep in mind that they are rare, but do exist. In order to always correctly split a name into family name and given name parts, you need to use a dictionary with family names.

Linux version:

use strict;
use warnings;
use Encode qw(decode encode);

while (my $full_name = <DATA>) {
    $full_name = decode('UTF-8', $full_name);
    chomp $full_name;
    my ($family_name, $given_name) = split(/(?<=.)/, $full_name, 2);
    print encode('UTF-8',
        sprintf('The full name is %s, the family name is %s, the given name is %s.', $full_name, $family_name, $given_name)
    );

}

__DATA__
张小三

Output:

The full name is 张小三, the family name is 张, the given name is 小三.

Windows version:

use strict;
use warnings;
use Encode qw(decode encode);
use Encode::HanExtra qw();

while (my $full_name = <DATA>) {
    $full_name = decode('GB18030', $full_name);
    chomp $full_name;
    my ($family_name, $given_name) = split(/(?<=.)/, $full_name, 2);
    print encode('GB18030',
        sprintf('The full name is %s, the family name is %s, the given name is %s.', $full_name, $family_name, $given_name)
    );

}

__DATA__
张小三

Output:

The full name is 张小三, the family name is 张, the given name is 小三.
daxim
@daxim, refer to my another post today. That's why I can't use Encode::HanExtra till now.`http://stackoverflow.com/questions/2726641/how-do-i-install-encodehanextra-for-activeperl` . BTW, Could you please tell me why you Linxu Version (Only `use Encode`) can't run on my window ActivePerl5.10.1. Thank you.
Nano HE
I gave a good answer where to get `Encode::HanExtra` for ActiveState Perl in that thread. — I do not have enough room to answer your other question here in a comment. I suggest you make a new proper Stack Overflow question out of it: http://stackoverflow.com/questions/ask
daxim