views:

174

answers:

7

I have a Perl script that reads data from an Excel (xls) binary file. But the client that sends us these files has started sending us XLSX format files at times. I've updated the script to be able to read those as well. However, the client sometimes likes to name the XLSX files with an .xls extension, which currently confuses the heck outta my script since it uses the file name to determine which file type it is.

An XLSX file is a zip file that contains XML stuff. Is there a simple way for my script to look at the file and tell whether it's a zip file or not? If so, I can make my script go by that instead of just the file name.

A: 

I can't say about Perl, but with the framework I use, .Net, there are a number of libraries available that will manipulate zip files you could use.

Another thing that I've seen people use is the command-line version of WinZip. It give a return-value that is 0 when a file is unzipped and non-zero when there is an error.

This may not be the best way to do this, but it's a start.

Rice Flour Cookies
+2  A: 

Use File::Type:

my $file = "foo.zip";
my $filetype = File::Type->new( );

if( $filetype->mime_type( $file ) eq 'application/zip' ) {
  # File is a zip archive.
  ...
}

I just tested it with a .xlsx file, and the mime_type() returned application/zip. Similarly, for a .xls file the mime_type() is application/octet-stream.

CanSpice
+6  A: 

Edit: Archive::Zip is a better

solution
 # Read a Zip file
   my $somezip = Archive::Zip->new();
   unless ( $somezip->read( 'someZip.zip' ) == AZ_OK ) {
       die 'read error';
   }
weismat
+1 Always first check CPAN :)
Konerak
This doesn't work -- this uses the filename suffix to determine the file type, see http://search.cpan.org/~bingos/Archive-Extract-0.46/lib/Archive/Extract.pm . I upvoted this, but it's too late and I can't remove my vote.
Adam Rosenfield
+12  A: 

.xlsx files have the first 2 bytes as 'PK', so a simple open and examination of the first 2 characters will do.

Bruce Armstrong
To be more specific, the first 4 bytes are `"PK\003\004"`.
cjm
While that's probably true for all .xlsx files produced by particular applications, the ZIP file format does not require that -- see http://en.wikipedia.org/wiki/Zip_file#Structure .
Adam Rosenfield
Yes! This is what I was hoping for; a quick 'n easy way to check a file, preferably without having to use yet another module. Thanks!
DaveKub
+16  A: 

Yes, it is possible by checking magic number.

There are quite a few modules in Perl for checking magic number in a file.

An example using File::LibMagic:

use strict;
use warnings;

use File::LibMagic;

my $lm = File::LibMagic->new();

if ( $lm->checktype_filename($filename) eq 'application/zip; charset=binary' ) {
    # XLSX format
}
elsif ( $lm->checktype_filename($filename) eq 'application/vnd.ms-office; charset=binary' ) {
    # XLS format
}

Another example, using File::Type:

use strict;
use warnings;

use File::Type;

my $ft = File::Type->new();

if ( $ft->mime_type($file) eq 'application/zip' ) {
    # XLSX format
}
else {
    # probably XLS format
}
Alan Haggai Alavi
File::Type is a rather large module. Since you're only interested in one filetype, I'd probably copy the test from there. It's just checking to see if the first 4 bytes of the file are `"PK\003\004"`.
cjm
+1 for libmagic. The next release will contain many improvements for zip-derived file types, see [mailing list archive](http://mx.gw.com/pipermail/file/2010/thread.html).
daxim
+1  A: 

You can detect the xls file by checking the first bytes of the file for Excel headers.

A list of valid older Excel headers can be gotten from here (unless you know exact version of their Excel, check for all applicable possibilities):

http://toorcon.techpathways.com/uploads/headersig.txt


Zip headers are described here: http://en.wikipedia.org/wiki/ZIP_(file_format)#File_headers but i'm not sure if .xlsx files have the same headers.

File::Type's logic seems to be "PK\003\004" as the file header to decide on zip files... but I'm not certain if that logic would work as far as .xlsx, not having a file to test.

DVK
A: 
The-Evil-MacBook:~ ivucica$ file --mime-type --brief file.zip 
application/zip

Hence, probably comparing

`file --mime-type --brief $filename`

with application/zipwould do the trick of detecting zips. Of course, you need to have file installed which is quite usual on UNIX systems. I'm afraid I cannot provide Perl example since all knowledge of Perl evaporated from my memory, and I have no examples at hand.

Ivan Vučica