views:

172

answers:

3

I am trying to copy all files in one location to a different location and am using the File::Copy module and copy command from that, but now the issue I am facing is that I have file whose name has special character whose ascii value is &#253 but in unix file system it is stored as ? and so my question is that will copy or move command consider this files with special characters while copying or moving to another location or not, if now then what would be an possible work around for this ?

Note: I cannot create file with special characters in unix because special characters are replaced with ? and I cannot do so in Windows because on Windows Special Characters are replaced with the Encoded value as in my case of &#253 ?

my $folderpath = 'the_path';
open my $IN, '<', 'path/to/infile';
my $total;
while (<$IN>) {
    chomp;
    my $size = -s "$folderpath/$_";
    print "$_ => $size\n";
    $total += $size;
}
print "Total => $total\n";

Courtesy: RickF Answer

Any suggesion would be highly appreciated.

Reference Question : Perl File Handling Question

A: 

The following script works as expected for me:

#!/usr/bin/perl

use strict; use warnings;
use autodie;

use File::Copy qw( copy );
use File::Spec::Functions qw( catfile );

my $fname = chr 0xfd;

open my $out, '>', catfile($ENV{TEMP}, $fname);
close $out;

copy catfile($ENV{TEMP}, $fname) => catfile($ENV{HOME}, $fname);
Sinan Ünür
@Sinan: I am not able to understand the script and would appreciate if you add some comments to it.
Rachel
@Rachel The script creates a file whose name consists solely of the single character with character code 253. Then, copies that file from my temporary directory to my home directory.
Sinan Ünür
@Sinan: Script mentioned in the question print the size of all files but if files have some special characters than it ignores them, to my surprise, `#,@,$` are taken by the script and it does give me the size but `ý` is not considered, am having hard time to understand why is this the case and would like to know your thoughts on this or if you can direct me to some proper read than I would really appreciate it.
Rachel
@Sinan Unur: I trying to understand flow of instruction and if possible, it would really nice if you can add more comments inside the code itself so that I can get better understanding of how it works and it would help me.
Rachel
+3  A: 

As workaround I can suggest to convert all unsupported characters to supported. This can be done in many ways. For example you can use URI::Escape:

use URI::Escape;
my $new_file_name = uri_escape($weird_file_name);

Update:

Here is how I was able to copy file by its uft-8 name. I'm on Windows. I've used Win32::GetANSIPathName to get short file name. Then it was copied nice:

use File::Copy;
use URI::Escape;
use Win32;

use utf8; ## tell perl that source code is in utf-9
use strict;
use warnings;

my $test_file = "IBMýSoftware.txt";
my $from_file = Win32::GetANSIPathName($test_file); ## get "short" name of file
my $to_file   = uri_escape($test_file); ## name with special characters escaped

printf("copy [%s] -> [%s]\n", $from_file, $to_file);
copy($from_file, $to_file);

After coping all file to new names on Windows, you'll be able to work with them on linux without problems.

Here are some hints about utf-8 file opening:

Ivan Nevostruev
I tried using this in my script but when I run my script then it just excludes the files which have very wierd special characters, any thoughts ?
Rachel
@Rachel I guess script can't create files with converted name. Can you give an example of failing file name after `uri_escape` function call?
Ivan Nevostruev
Rachel
@Rachel See my update
Ivan Nevostruev
@Ivan: Can you explain `printf and copy` related commands statements and how it is working as I am having hard time in understanding the program flow.
Rachel
I've changed variables name for more clear code. `printf` is used for debug purpose only. The main idea is that "short" name of the file (using `Win32::GetANSIPathName`) can be used without problems to copy/open file with utf-8 name. But it's Windows only solution. Next I suggest you to generate new name without special character (using `uri_escape`). And after files is copied to new name you can manipulate it without problems.
Ivan Nevostruev
@Ivan - Is there a need to use `Win32::GetANSIPathName`, can we approach the problem without using `Win32::GetANSIPathName`, what would be possible consequences ?
Rachel
@Rachel Yes, I think it's posible, but I was not luck in my short experiment. It should work as described in http://stackoverflow.com/questions/1742279/with-a-utf8-encoded-perl-script-can-it-open-a-filename-encoded-as-gb2312 (If you can open file, then you'll be able to copy it)
Ivan Nevostruev
@Ivan: What actually is happening with this line of code `my $from_file = Win32::GetANSIPathName($test_file);`, I understand that it shortens the file name but how does it shortens and how certain we can be in saying that shortening the filename would not remove some of the special characters present in the filename ?
Rachel
Well, `Win32::GetANSIPathName` *will* remove special characters from filename. From manual: "Returns an ANSI version of FILENAME. This may be the short name if the long name cannot be represented in the system codepage. If FILENAME doesn't exist on the filesystem, or if the filesystem doesn't support short ANSI filenames, then this function will translate the Unicode name into the system codepage using replacement characters.". I guess that Windows File System (NTFS) supports multiple names for same file. This is done for backward compatibility with old DOS/Win95 programs.
Ivan Nevostruev
@Ivan: I was under the impression that we are removing special characters using `uri_escape`, right now am confused between `uri_escape` and `Win32::GetANSIPathName` from functionality point of view, any suggestions ?
Rachel
The difference is that `Win32::GetANSIPathName` is another way to access same file (and it's Windows specific). But `uri_escape` just creates new string dependant on its parameter, which can be any string. In any case all special characters will be lost when you copy file to new name. No matter how you've obtained new name, using `Win32::GetANSIPathName` or `uri_escape`.
Ivan Nevostruev
So it's not a solution to your problem. It's workaround to access file with "special" name and covert name to more usable form.
Ivan Nevostruev
Oh ok, now I got it that we can either use Win32 OR uri_escape and both would do the same work, moreover Win32 would reduce the filename and provide a shorter name in place. Correct me if my understanding is wrong here.
Rachel
Not exactly. There are 2 names of the file in Windows: one is `IBMýSoftware.txt` and the other is returned by `Win32::GetANSIPathName`. You can use both of them to manipulate files. But due to some Windows API restrictions first name is not working in perl. So you can use "Win32::GetANSIPathName" name. `uri_escape` is completly different. It's just an example how you can convert string with "special" characters into string without them. You can't access file name returned by `uri_escape` before you create that file explicitly.
Ivan Nevostruev
But lets say if am using this script in Unix System then it should work fine right, file with special character is stored in Unix File System and my perl script is just trying to get size of this file with special character name and its unable to do so and so am having hard time to relate Unix with Win32 as in that case, Win32 is totally irrelevant if my file is stored in Unix, right ?
Rachel
You can't use `Win32::GetANSIPathName` in Unix for sure. As you've said in OP, special symbols are converted to `?` on unix. So you'd be able to read them with new name (`IBM?Software.txt` in my example).
Ivan Nevostruev
Yes. I am able to see the txt file as IBM?Software.txt but to my surprise perl script was not able to read this file and get size of it. So using utf-8 would solve the issue ?
Rachel
Try it. And try to read using string `IBM?Software.txt` as name. One should work.
Ivan Nevostruev
+3  A: 

Character 253 is ý. I guess that on your Unix system the locale is not set, or only the most primitive fall-back locale is in effect, and that is why you see a replacement character. If I am guessing correctly, the solution is to simply set the locale to something, preferably to an UTF-8 locale since that can handle all characters, and Perl shouldn't even enter into the problem.

> cat 3761218.pl
use utf8;
use strict;
use warnings FATAL => 'all';
use autodie qw(:all);

my $file_name = '63551_106640_63551 IBMýSoftware Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm';
open my $h, '>', $file_name;

> perl 3761218.pl
> ls 6*
63551_106640_63551 IBMýSoftware Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm
> LANG=C ls 6* # temporarily cripple locale so that the problem in the question is exhibited
63551_106640_63551 IBM??Software Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm
> locale | head -1 # show which locale I have set
LANG=de_DE.UTF-8
daxim
Can you elaborate more on this, especially why are you using two perl files and how the locale is being set in Unix ?
Rachel
I only use one Perl file. – I do not know which Unix you use, so I am not able to give a good answer. However, you can easily get an answer to that on your own. Search the [Serverfault archives](http://serverfault.com/search) or use any general purpose Web search engine such as [Google](http://google.com) to find the documentation you need.
daxim
@daxim: I am not able to understand as to what is happening in `LANG=C ls 6*` and `locale | head -1LANG=de_DE.UTF-8`, can you provide some comments in the code to explain this scenario as it would be helpful for me to learn and get better understanding from it.
Rachel
Okay, I added some comments. Prefixing a normal command with a variable assignment is nothing special, this is shell programming basics. Read more tutorials or a book: http://oreilly.com/catalog/9780596009656/
daxim
Hmm...thanks Daxim for the updates, so basically only by setting locale to UTF-8, problem would be solved and my script would be able to get the size of the file with special characters which it was unable to do so.
Rachel