detecting binary files and character encodings in zipfiles

views:

answers:

+1 Q:

detecting binary files and character encodings in zipfiles

When reading zipfiles (using Java ZipInputStream or any other library) from an unknown source is there any way of detecting which entries are "character data" (and if so the encoding) or "binary data". And, if binary, any way of determining any more information (MIME types, etc.)

EDIT does the ByteOrderMark (BOM) occur in zipentries and if so do we have to make special operations for it.

+1 A:

It basically boils down to heuristics for determining the contents of files. For instance, for text files (ASCII) it should be possible to make a fairly good guess by checking the range of byte values used in the file -- although this will never be completely fool-proof.

You should try to limit the classes of file types you want to identify, e.g. is it enough to discern between "text data" and "binary data" ? If so you should be able to get a fairly high success rate for detection.

For UNIX systems, there is always the file command which tries to identify file types based on (mostly) content.

csl 2009-10-08 09:46:03

Maybe implement a Java component that is capable of applying the rules defined in /usr/share/file/magic. I would love to have something like that. (You would basically have to be able to look at the first x couple of bytes.)

Wilfred Springer 2009-10-10 18:53:50

ansaurus

tags:

views:

answers:

detecting binary files and character encodings in zipfiles

related questions