views:

1567

answers:

8

How can I tell if a file is binary (non-text) in python? I am searching through a large set of files in python, and keep getting matches in binary files. This makes the output look incredibly messy.

I know I could use grep -I, but I am doing more with the data than what grep allows for.

In the past I would have just searched for characters greater than 0x7f, but utf8 and the like make that impossible on modern systems. Ideally the solution would be fast, but any solution will do.

A: 

are you in unix? if so, then try:

isBinary = os.system("file " + name + " | grep text > /dev/null")

The shell return values are inverted (0 is ok, so if it finds "text" then it will return a 0, and in Python that is a False expression).

fortran
This works based on the extension of the file?
Lakshman Prasad
For reference, the file command guesses a type based on the file's content. I'm not sure whether it pays any attention to the file extension.
David Zaslavsky
I'm almost sure it looks both in the content and the extension.
fortran
This breaks if the path contains "text", tho. Make sure to rsplit at the last ':' (provided there's no colon in the file type description).
Alan
Use `file` with the `-b` switch; it'll print only the file type without the path.
dubek
+2  A: 

Usually you have to guess.

You can look at the extensions as one clue, if the files have them.

You can also recognise know binary formats, and ignore those.

Otherwise see what proportion of non-printable ASCII bytes you have and take a guess from that.

You can also try decoding from UTF-8 and see if that produces sensible output.

Douglas Leeder
+3  A: 

You can also use the mimetypes module:

import mimetypes
...
mime = mimetypes.guess_type(file)

It's fairly easy to compile a list of binary mime types. For example Apache distributes with a mime.types file that you could parse into a set of lists, binary and text and then check to see if the mime is in your text or binary list.

Crad
Is there a way to get `mimetypes` to use the contents of a file rather than just its name?
intuited
+3  A: 

If it helps, many many binary types begin with a magic numbers. Here is a list of file signatures.

Shane C. Mason
+2  A: 

Here's a suggestion that uses the Unix file command:

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

Example usage:

>>> istext('/etc/motd') 
True
>>> istext('/vmlinuz') 
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

It has the downsides of not being portable to Windows (unless you have something like the file command there), and having to spawn an external process for each file, which might not be palatable.

Jacob Gabrielson
+1  A: 

If you're not on Windows, you can use Python Magic to determine the filetype. Then you can check if it is a text/ mime type.

Kamil Kisiel
A: 

Try this:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <[email protected]>
    @author: Jorge Orpinel <[email protected]>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False
Jorge Orpinel
-1 defines "binary" as containing a zero byte. Will classify UTF-16-encoded text files as "binary".
John Machin
Thanks @John Machin
Jorge Orpinel
@John Machin: Interestingly, `git diff` actually [works this way](http://git.kernel.org/?p=git/git.git;a=blob;f=xdiff-interface.c;h=e1e054e4d982de30d8a9c8c4109c6d62448f62a9;hb=HEAD#l240), and sure enough, it detects UTF-16 files as binary.
intuited
Hunh.. GNU `diff` also works this way. It has similar issues with UTF-16 files. `file` does correctly detect the same files as UTF-16 text. I haven't checked out `grep` 's code, but it too detects UTF-16 files as binary.
intuited
A: 

I guess that the best solution is to use the guess_type function. It holds a list with several mimetypes and you can also include your own types. Here come the script that I did to solve my problem:

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

It is inside of a Class, as you can see based on the ustructure of the code. But you can pretty much change the things you want to implement it inside your application. It`s quite simple to use. The method getTextFiles returns a list object with all the text files that resides on the directory you pass in path variable.

Regards, Leonardo

Leonardo