views:

49

answers:

2

I'm writing a shell script that will create a textual (i.e. diffable) dump of an archive.

I'd like to detect whether or not each file is printable in some given character set, and if it is printable, I'd like to convert to that character set from whatever one it's in, if this is possible, and make its contents part of the dump.

I've considered using the file utility, but there doesn't seem to be any way to tell it to just print the character encoding or data. For example:

$ file -e soft -e tokens -e tar -e apptype -e cdf -e compress -e elf -e tar config.sub
config.sub: Lisp/Scheme program text

config.sub is one of the files distributed with the file source code.

I'm also a bit wary of parsing its rather unpredictable output.

I'd like to keep dependencies for this script to a minimum. I'm already using perl, but would prefer not to have to rely on any perl packages. Presumably iconv would be the best way to do the conversion, and I don't mind making this a dependency.

On the other hand, maybe such a utility as my nascent script is already readily available?

update: I ended up writing this in Python instead. It can be found in its github repo or on PyPI. The current version doesn't actually do the stuff that I mentioned in this question: that ended up being too time-consuming and not necessary enough to implement.

It might make its way into a later revision, though; if so, I will likely end up using some combination of quick scanning for binary detection (as mentioned in one of the comment threads) and use of the chardet module, as mentioned by Zack. Another option might be to use the Python wrapper for the file C utility, though I'm not sure how portable this is.

+1  A: 

The Universal Encoding Detector does a pretty damn good job of this -- it's not possible to do it perfectly, alas. And it requires Python.

Zack
I'm leaning towards a rewrite in Python anyway. The stumbling block there is that I'd like to tidy XML files on their way out, and it seems confusing to mix dependencies on Python modules with dependencies on utils like `tidy`. I could probably use `lxml` instead of `tidy`, but that's sort of the same thing since `lxml` has external dependencies — `libxml2` and `libxslt`.
intuited
At this point I'm regretting `+1`-ing your answer, because it seems terribly inappropriate to my goals. I suspect/hope that `chardet` is quite good and reasonably fast when files are known to be text, but it's abysmal at detecting binary. It took about a minute, maybe more, to determine with 0.316 confidence that the python executable was in ISO-8859-2 encoding. That was using the incremental analyzer; it reads the whole file anyway because "Not a valid encoding" is not something that it can have confidence in. Although the docs are fairly sparse; maybe there is an option to enable this?
intuited
You could slap a really simple binary detector on the front of it. I suggest the "text or binary?" algorithm from http://tools.ietf.org/html/draft-abarth-mime-sniff-05#page-8 (it basically scans the first 512 bytes of the file looking for control characters and/or all-ASCII magic sequences that identify common binary files, e.g. "GIF89a".)
Zack
This is basically what I ended up doing, except I'm going with really *really* simple. I'm just checking for NUL characters in the first 8000 bytes of the file. This is exactly what `git diff` does, and is the same approach used by GNU `diff` and `grep`. At this point, actually converting from another character set is a nice-to-have, and turned out to probably not be necessary. So I'm just checking the same initial chunk of the file for utf-8 compatibility by trying to `str.decode` it. The thing is working reasonably quickly.
intuited
Thanks for that link, it's pretty handy to have all the disparate factors condensed into that document. I'm mostly targeting OpenDocument Format files with this utility (at least initially) and I think that the archive members will always be either ASCII or binary. But that info will come in handy if I end up deciding to generalize it to handle a wider range of archives.
intuited
+2  A: 

Have you tried the mime options which give more consistent output?

file --mime-encoding --mime-type -b somefile
Dennis Williamson
Hey, that might work well. Thanks.
intuited