views:

300

answers:

2

People these days create their ZIP archives with WinZIP, which allows for internationalized (i.e. non-latin: cyrillic, greek, chinese, you name it) file names.

Sadly, trying to unpack such file causes trouble: UNIX unzip creates garbage-named files and dirs like "®£¤ ©¤¥èì". Java and its jar command fails miserably on such archives.

Is there a passable way to unpack such files programmatically? UNIX or Java.

+2  A: 

The solution I've found: Apache commons-compress can unzip such archives just fine, if supplied with correct fallback charset.

alamar
+2  A: 

DotNetZip supports unicode and arbitrary encodings for filenames within zipfiles, either for reading or writing zips.

It's a .NET library. For Unix usage, you would need Mono as a pre-requisite.

If the zipfile is correctly constructed by WinZip, in other words if it's compliant with the zip spec from PKWare, then there's no special work you need to do to specify the encoding at the time you unpack it. According to the zip spec, there are two supported encodings used for filenames in zipfiles: UTF-8 and IBM437. The use of one or the other of these encodings is specified in the zip metadata and any zip library can detect and use it. DotNetZip automatically detects it when reading a compliant zip. like this:

using (var zip = ZipFile.Read("thearchive.zip"))
{
    foreach (var e in zip) 
    {
        // e.FileName refers to the name on the entry
        e.Extract("extract-directory");
    }
}

There are archive programs that produce zips that are "non compliant" w.r.t. encoding. WinRar is one - it will create a zip that has filenames encoded in the default encoding in use on the computer. In Shanghai it will use cp950, while in Iceland, something else, and in Lisbon, something else. The advantage to "non compliance" here is that Windows Explorer will open and correctly display i18n-ized filenames in such zips. In other words, "non compliance" is often what people want, because Windows doesn't (yet?) support UTF-8 zip files.

(This all has to do with the encoding used in the zipfile, not the encoding used in the files contained in the zip file)

The zip spec doesn't allow for the specification of an arbitrary text encoding in the zip metadata. In other words if you use cp950 when creating the zip, then your extract logic needs to "know" to use cp950 when extracting - nothing in the zip file carries that information. In addition, of course, the zip library you use to programmatically extract must support arbitrary encodings. As far as I know, Java's zip library does not. DotNetZip does. Like so:

using (ZipFile zip = ZipFile.Read(zipToExtract,
                                  System.Text.Encoding.GetEncoding(950)))
{
  foreach (ZipEntry e in zip)
  {
     e.Extract(extractDirectory);
  }
}

DotNetZip can also create zip files with arbitrary encodings - "non compliant" zips.

DotNetZip is free, and open source.

Cheeso
Thanks, but installing 7z was easier, because it's already in repository.
alamar