views:

327

answers:

2

Hi,

Do you know some library/way in Java to generate tar archive with file names in proper windows national codepage ( for example cp1250 ).

I tried with Java tar, example code:

final TarEntry entry = new TarEntry( files[i] );
String filename = files[i].getPath().replaceAll( baseDir, "" );
entry.setName( new String( filename.getBytes(), "Cp1250" ) );
out.putNextEntry( entry );
...

It doesn't work. National characters are broken where I extract tar in windows. I've also found a strange thing, under Linux Polish national characters are shown correctly only when I used ISO-8859-1:

entry.setName( new String( filename.getBytes(), "ISO-8859-1" ) );

Despite the fact that proper Polish codepage is ISO-8859-2, which doesn't work too. I've also tried Cp852 for windows, no effect.

I know the limitations of tar format, but changing it is not an option.

Thanks for suggestions,

A: 

tar doesn't allow for non-ASCII values in its headers. If you try a different encoding, the result is probably up to what the target platform decides to do with those byte values. It kind of sounds like your target platform's tar program is interpreting the bytes as ISO-8859-1, which is why that 'works'.

Have a look at extended attributes? http://www.freebsd.org/cgi/man.cgi?query=tar&sektion=5&manpath=FreeBSD+8-current

I am no expert here but this seems to be the only official way to put any non-ASCII values in a tar file header.

Sean Owen
+1  A: 

Officially, TAR doesn't support non-ASCII in headers. However, I was able to use UTF-8 encoded filenames on Linux.

You should try this,

String filename = files[i].getName();
byte[] bytes = filename.getBytes("Cp1250")
entry.setName(new String(bytes, "ISO-8859-1"));
out.putNextEntry( entry );

This at least preserves the bytes in Cp1250 in TAR headers.

ZZ Coder
Thanks a lot! It works.National characters after unpacking in Windows are ok.I have to look into construct ''new String(filename.getBytes("Cp1250"), "ISO-8859-1")'' and understand it correctly.
pawelsto
You have to read TAR code to see how it works. TAREntry doesn't understand encoding. It simply copies lower byte of a UTF-16 char to the TAR file. In Unicode, lower byte is entirely mapped to Latin-1 so we use Latin-1 to preserve the byte array. It has nothing to do with Latin-1 encoding at all.
ZZ Coder