views:

262

answers:

7

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.

Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.

To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.

+2  A: 

A quick-and-dirty way is to look for a NUL character (a zero byte) in the first K or two of the file. As long as you're not worried about UTF-16 or UTF-32, no text file should ever contain a NUL.

Update: According to the diff manual, this is exactly what diff does.

RichieHindle
(If you're not settled somewhere in Asia,) I'd go with this approach.
Boldewyn
+1  A: 

These days the term "text file" is ambiguous, because a text file can be encoded in ASCII, ISO-8859-*, UTF-8, UTF-16, UTF-32 and so on.

See here for how Subversion does it.

Christoffer Hammarström
But literally *everything* uses bytes > 0x7F today, even translated man pages or ISO-8859 text files. This would exclude way too much, that is, every non-ASCII text file. Since there is, however, an almost disappearing probability to see a \0 in a text file, RichieHindle's approach seems more appropriate to me (that is, for every file written since the early 80s).
Boldewyn
You're right, i edited my answer.
Christoffer Hammarström
+3  A: 

You could try to give a

strings yourfile

command and compare the size of the results with the file size ... i'm not totally sure, but if they are the same the file is really a text file.

Simone Margaritelli
Where 'definitely' depends on the implementation of the `strings` command. But, yes, +1 for the idea.
Boldewyn
I said "not totally sure" just for the implementation issue, but on a general note it should work .
Simone Margaritelli
+1, this would work on most GNU platforms. `strings file | wc.c` then `wc -c file`.
Tim Post
any way to do it without creating a temporary file? also, large files could pose an issue here
gabor
strings file | head -c <bytes you want to check> | wc -cand compare to wc -c file
Simone Margaritelli
+4  A: 

file is still the command you want. Any file that is text (according to its heuristics) will include the word "text" in the output of file; anything that is binary will not include the word "text".

If you don't agree with the heuristics that file uses to determine text vs. not-text, then the question needs to be better specified, since text vs. non-text is an inherently vague question. For example, file does not identify a PGP public key block in ASCII as "text", but you might (since it is composed only of printable characters, even though it is not human-readable).

Tyler McHenry
the output of `file` does not always contain the word "text", e.g. it's not the case for XML files. however `file -i` will give the mime type, which indeed contains the word "text"
gabor
When I execute `file` on an XML document, it reports "XML document text". Perhaps the one you're testing with includes encoded binary data that file doesn't think is text?
Tyler McHenry
+3  A: 

The diff manual specifies that

diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.

David Schmitt
this seems to be a good heuristics. `svn` does the same thing as noted by Christoffer below
gabor
A: 

Commands like less, grep detect it quite easily(and fast). You can have a look at their source.

Raghu
A: 

Try this:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <[email protected]>
    @author: Jorge Orpinel <[email protected]>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False
Jorge Orpinel
woops i thought this was a python thread hehe sorry about that
Jorge Orpinel