views:

435

answers:

3

For example, given an empty file テスト.txt, how would I make a copy called テスト.txt.copy?

My first crack at it managed to access the file and create the new filename, but the copy generated テスト.txt.copy.

Here was my first crack at it:

#!/usr/bin/env perl

use strict;
use warnings;

use English '-no_match_vars';
use File::Basename;
use Getopt::Long;

use File::Copy;
use Win32;

my (
    $output_relfilepath,
   ) = process_command_line();

open my $fh, '>', $output_relfilepath or die $!;
binmode $fh, ':utf8';
foreach my $short_basename ( glob( '*.txt') ) {

  # skip the output basename if it's in the glob
  if ( $short_basename eq $output_relfilepath ) {
    next;
  }

  my $long_basename = Win32::GetLongPathName( $short_basename );
  my $new_basename  = $long_basename . '.copy';

  print {$fh} sprintf(
                      "short_basename = (%s)\n" .
                      " long_basename = (%s)\n" .
                      "  new_basename = (%s)\n",
                      $short_basename,
                      $long_basename,
                      $new_basename,
                     );
  copy( $short_basename, $new_basename );
}

printf(
       "\n%s done! (%d seconds elapsed)\n",
       basename( $0 ),
       time() - $BASETIME,
      );

# === subroutines ===

sub process_command_line {

  # default arguments
  my %args
    = (
       output_relfilepath => 'output.txt',
      );

  GetOptions(
             'help'                 => sub { print usage(); exit },
             'output_relfilepath=s' => \$args{output_relfilepath},
            );

  return (
          $args{output_relfilepath},
         );
}

sub usage {
  my $script_name = basename $0;

  my $usage = <<END_USAGE;
======================================================================

Test script to copy files with a UTF-8 filenames to files with
different UTF-8 filenames.  This example tries to make copies of all
.txt files with versions that end in .txt.copy.

  usage: ${script_name} (<options>)

options:

  -output_relfilepath <s>   set the output relative file path to <s>.
                            this file contains the short, long, and
                            new basenames.
                            (default: 'output.txt')

----------------------------------------------------------------------

examples:

  ${script_name}

======================================================================
END_USAGE

  return $usage;
}

Here are the contents of output.txt after execution:

short_basename = (BD9A~1.TXT)
 long_basename = (テスト.txt)
  new_basename = (テスト.txt.copy)

I've tried replacing File::Copy's copy command with a system call:

my $cmd = "copy \"${short_basename}\" \"${new_basename}\"";
print `$cmd`;

and with Win32::CopyFile:

Win32::CopyFile( $short_basename, $new_basename, 'true' );

Unfortunately, I get the same result in both cases (テスト.txt.copy). For the system call, the print shows 1 file(s) copied. as expected.

Notes:

+3  A: 

You're getting the long filename using Win32, which gives you a UTF-8-encoded string.

However, you're then setting the long filename using plain copy, which uses the C stdlib IO functions. The stdlib functions use the default filesystem encoding.

On modern Linuxes that's usually UTF-8, but on Windows it (sadly) never is, because the system default code page cannot be set to UTF-8. So you'll get your UTF-8 string interpreted as a code page 1252 string on a Western European Windows install, as has happened here. (On a Japanese machine it'd get interpreted as code page 932 — like Shift-JIS — which would come out something like 繝�せ繝�.)

I've not done this in Perl, but I'd suspect the Win32::CopyFile function would be more likely to be able to handle the kind of Unicode paths returned elsewhere in the Win32 module.

bobince
Thanks for the info. I also tried both a standard windows copy (system call) and Win32::CopyFile to no avail (updated the question post with new findings). I'm just (naively?) surprised at how difficult this is :(
vlee
Oh dear. If even the `Win32` interface won't accept Unicode filenames you may be pretty much scuppered. Yes, I'm afraid the combination of native-Unicode Windows and byte-string C stdlib is very uncomfortable thanks to Windows's refusal to standardise on UTF-8 for the encoding. It's not possible to handle Unicode filenames from a stdlib-only interface like Perl's core uses. :-( It was impossible on Python too, until special support was added to use the native Windows interfaces. Sorry!
bobince
+3  A: 

This should be possible with the CopyFileW function from Win32API::File, which should be included with Strawberry. I've never messed with Unicode filenames myself, so I'm not sure of the details. You might need to use Encode to manually convert the filename to UTF-16LE (encode('UTF16-LE', $filename)).

cjm
That looks good. `CopyFileW` is certainly the underlying system call you'd need to use to do this; annoying it's not part of the `Win32` module.
bobince
A: 

I successfully duplicated your problem on my Windows machine (Win XP Simplified Chinese version) and my conclusion is that the problem is caused by the font. Choose a Truetype font rather than Raster fonts and see if everything is okay.

My experiment is this:

  1. I first changed the code page of my Windows Console from the default 936 (GBK) to 65001 (UTF-8). by typing C:>chcp 65001

  2. I wrote a scrip that contains the code: $a= "テスト"; print $a; and saved it as UTF-8.

  3. I ran the script from the Console and found "テスト" became "テスト", which is exactly the same sympton you described in your question.

  4. I changed the Console Font from Raster Fonts to Lucida Console, the console screen gave me this: "テストストトト", which is still not quite right but I assume it is getting closer to the core of the problem.

So althought I'm not 100% sure but the problem is probably caused by the font.

Hope this helps.

Mike