ansaurus

Question

How do I distinguish between 'binary' and 'text' files?

Answer 1

+3 A:

Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.

To distinguish a unicode text file, MSDN offers some great advice as to what to do.

The gist of it is to first inspect up to the first four bytes:

EF BB BF     UTF-8 
FF FE        UTF-16, little endian 
FE FF        UTF-16, big endian 
FF FE 00 00  UTF-32, little endian 
00 00 FE FF  UTF-32, big-endian

That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.

MSN 2009-02-19 23:51:10

Does only work for files that use this rule.

Georg 2009-02-20 00:57:59

Well if it doesn't follow those rules then it really isn't a text file. Except for mbcs, but that's an entirely different story.

MSN 2009-02-20 05:54:48

Answer 2

+2 A:

Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.

dwc 2009-02-19 23:52:22

Answer 3

+10 A:

You can use the "file" command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.

file README
README: ASCII English text, with very long lines

file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped

2009-02-19 23:54:20

definitely the answer I was going to give...

bjeanes 2009-02-19 23:55:09

+1 If it's a Linux system, file is going to have much better heuristics than anything you'll build yourself.

Adam Lassek 2009-02-20 22:07:50

Yeah, if file is available, it is going to be the best tool for the job. No question! Also the 'file -I' is a neat trick. I hadn't thought of shelling out for my particular problem, however I don't think I could cop the performance overhead. Thanks!

benno 2009-02-20 22:23:56

Answer 4

A:

You can determine the MIME type of the file (file -i on Linux, file -I (capital i) on Mac OS X). If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.

phihag 2009-02-19 23:55:50

I think that should be "file -I" (upper case). At least according to my tests and man page.

benno 2009-02-20 22:22:20

Just looked it up, lower case is correct in Debian and gentoo Linux. Their file is ftp://ftp.astron.com/pub/file/file-5.00.tar.gz (or a different version). -I(upper) is an option in neither one.

phihag 2009-02-20 22:45:10

Huh, weird. The version on OS X (4.17) uses -I (upper) and the one on my Linux boxes (4.24) uses -i (lower). How bizzare! I wonder if it is an OS X-ism, or the authors simply changed the interface in between point release.

benno 2009-02-21 20:02:39

Answer 5

+2 A:

One simple check is if it has \0 characters. Text files don't have them.

Georg 2009-02-20 00:59:21

Answer 6

+3 A:

Our software reads a number of binary file formats as well as text files.

We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.

Joe Erickson 2009-02-20 01:10:00

Answer 7

+1 A:

As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.

This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.

The structure and description of the magic file can be found by consulting the relevant manual page (man magic)

As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following

/* Make sure we are dealing with ascii text before looking for tokens */
    for (i = 0; i < nbytes - 1; i++) {
     if (!isascii(buf[i]) ||
         (iscntrl(buf[i]) && !isspace(buf[i]) &&
          buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
         )
        )
      return 0; /* not all ASCII */
    }

Steve Weet 2009-02-20 01:47:57

ansaurus

tags:

views:

answers:

How do I distinguish between 'binary' and 'text' files?

related questions