views:

113

answers:

3

I am writing a Python script to index a large set of Windows installers into a DB.

I would like top know how to read the metadata information (Company, Product Name, Version, etc) from EXE, MSI and ZIP files using Python running on Linux.

Software

I am using Python 2.6.5 on Ubuntu 10.04 64-bit with Django 1.2.1.

Found so far:

Windows command line utilities that can extract EXE metadata (like filever from SysUtils), or other individual CL utils that only work in Windows. I've tried running these through Wine but they have problems and it hasn't been worth the work to go and find the libs and frameworks that those CL utils depend on and try installing them in Wine/Crossover.

Win32 modules for Python that can do some things but won't run in Linux (right?)

Secondary question:

Obviously changing the file's metadata would change the MD5 hashsum of the file. Is there a general method of hashing a file independent of the metadata besides locating it and reading it in (ex: like skipping the first 1024 byes?)


This is my first post here to StackOverflow. Since starting at my latest job as a new Python developer, I've been incredibly impressed with Stackoverflow and it has consistently shown up at the top of Google searches for my Python/Django queries and has high quality answers. Kudos to this community.

+2  A: 

Take a look at this library: http://bitbucket.org/haypo/hachoir/wiki/Home and this example program that uses the library: http://pypi.python.org/pypi/hachoir-metadata/1.3.3. The second link is an example program which uses the Hachoir binary file manipulation library (first link) to parse the metadata.

The library can handle these formats:

  • Archives: bzip2, gzip, zip, tar
  • Audio: MPEG audio ("MP3"), WAV, Sun/NeXT audio, Ogg/Vorbis (OGG), MIDI, AIFF, AIFC, Real audio (RA)
  • Image: BMP, CUR, EMF, ICO, GIF, JPEG, PCX, PNG, TGA, TIFF, WMF, XCF
  • Misc: Torrent
  • Program: EXE
  • Video: ASF format (WMV video), AVI, Matroska (MKV), Quicktime (MOV), Ogg/Theora, Real media (RM)

Additionally, Hachoir can do some file manipulation operations which I would assume includes some primitive metadata manipulation.

SimpleCoder
+1  A: 

To answer one of your questions, you can use the zipfile module, specifically the ZipInfo object to get the metadata for zip files.

As for hashing only the data of the file, you can only to that if you know which parts are data and which are metadata. There can be no general method as many file formats store their metadata differently.

Fabian
+1  A: 

To answer your second question: no, there is no way to hash a PE file or ZIP file, ignoring the metadata, without locating and reading the metadata. This is because the metadata you're interested in is stored at variable locations in the file.

In the case of PE files (EXE, DLL, etc), it's stored in a resource block, typically towards the end of the file, and a series of pointers and tables at the start of the file gives the location.

In the case of ZIP files, it's scattered throughout the archive -- each included file is preceded by its own metadata, and then there's a table at the end giving the locations of each metadata block. But it sounds like you might actually be wanting to read the files within the ZIP archive and look for EXEs in there if you're after program metadata; the ZIP archive itself does not store company names or version numbers.

Porculus