ansaurus

Question

Dumping text files

Answer 1

+1 A:

The Universal Encoding Detector does a pretty damn good job of this -- it's not possible to do it perfectly, alas. And it requires Python.

Zack 2010-10-13 01:26:35

I'm leaning towards a rewrite in Python anyway. The stumbling block there is that I'd like to tidy XML files on their way out, and it seems confusing to mix dependencies on Python modules with dependencies on utils like `tidy`. I could probably use `lxml` instead of `tidy`, but that's sort of the same thing since `lxml` has external dependencies — `libxml2` and `libxslt`.

intuited 2010-10-13 01:36:02

At this point I'm regretting `+1`-ing your answer, because it seems terribly inappropriate to my goals. I suspect/hope that `chardet` is quite good and reasonably fast when files are known to be text, but it's abysmal at detecting binary. It took about a minute, maybe more, to determine with 0.316 confidence that the python executable was in ISO-8859-2 encoding. That was using the incremental analyzer; it reads the whole file anyway because "Not a valid encoding" is not something that it can have confidence in. Although the docs are fairly sparse; maybe there is an option to enable this?

intuited 2010-10-13 06:06:46

You could slap a really simple binary detector on the front of it. I suggest the "text or binary?" algorithm from http://tools.ietf.org/html/draft-abarth-mime-sniff-05#page-8 (it basically scans the first 512 bytes of the file looking for control characters and/or all-ASCII magic sequences that identify common binary files, e.g. "GIF89a".)

Zack 2010-10-13 14:52:12

This is basically what I ended up doing, except I'm going with really *really* simple. I'm just checking for NUL characters in the first 8000 bytes of the file. This is exactly what `git diff` does, and is the same approach used by GNU `diff` and `grep`. At this point, actually converting from another character set is a nice-to-have, and turned out to probably not be necessary. So I'm just checking the same initial chunk of the file for utf-8 compatibility by trying to `str.decode` it. The thing is working reasonably quickly.

intuited 2010-10-14 09:04:31

Thanks for that link, it's pretty handy to have all the disparate factors condensed into that document. I'm mostly targeting OpenDocument Format files with this utility (at least initially) and I think that the archive members will always be either ASCII or binary. But that info will come in handy if I end up deciding to generalize it to handle a wider range of archives.

intuited 2010-10-14 09:09:06

Answer 2

+2 A:

Have you tried the mime options which give more consistent output?

file --mime-encoding --mime-type -b somefile

Dennis Williamson 2010-10-13 01:34:32

Hey, that might work well. Thanks.

intuited 2010-10-13 01:54:26

ansaurus

tags:

views:

answers:

Dumping text files

related questions