views:

120

answers:

5

How can I detect if a file is binary or a plain text?

Basically my .NET app is processing batch files and extracting data however I don't want to process binary files.

As a solution I'm thinking about analysing first X bytes of the file and if there are more unprintable characters than printable characters it should be binary.

Is this the right way to do it? Is there any better implementation for this task?

+2  A: 

Unix file command does this in a clever way. Of course, it does a lot more, but you can check the algorithm here and then build something specialized.


UPDATE: The link above seems to be broken. Try this.

Bruno Brant
Is this really applicable to a .Net app running on windows environment?
Moron
@Moron: yes, because `file` doesn't use OS-provided information to determine file type. It's just looking at BOM, magic numbers, content heuristics, etc as mentioned variously in the other answers.
Derrick Turk
@Derrick: What I meant was, does it detect files commonly found on Windows machines, say found on Windows Vista/ Windows 7? In any case, just pointing someone to the source code of 'file' is not really helpful.
Moron
@Moron: Sorry, but to provide a complete implementation of such algorithm would take a lot of time. `file` **is** system agnostic in its algorithms, although the source file is not. I think that anyone who can read C# can understand a bit of C code (since they are similar) so I thought you'd have no trouble finding the part of the source that was relevant to you. `file` is very reliable, and can tell you what you want (binary vs. plain-text) most accurately.
Bruno Brant
@Bruno: True, an off the shelf solution will be better than implementing on your own. Pointing to a unix C implementation of file does not help in that regard, though. If you noticed, I don't disagree strongly enough to give your answer a -1 :-)
Moron
A: 

You could regex the first X number of bytes, and give a valid match if all bytes are in a proper character class. But that might presuppose that you know the encoding.

Brent Arias
A: 

Detecting binary vs. text is a heuristic process. The one you are selecting is as good as any. You can apply at least the following rules:

  1. Select how many bytes maximum you are willing to test.
  2. Codes > 0x7f should not exist for any true text file.
  3. Codes < 0x20 may generally include { 0x08, 0x09, 0x0a, 0x0c, 0x0d, 0x1a, 0x1b, ... }
Amardeep
+2  A: 

What exactly do you mean by binary? Is the 'Art of War' written in Chinese binary to you? What about a Japanese-English dictionary?

There is no really 100% way.

You would need to use some kind of heuristic.

Some options might be to look at:

If the above (especially file signatures and extensions) don't help, then try to guess based on the presence/absence of certains bytes (like you are doing).

Note: It is better to check extensions/signatures first, as you would only need to read a few bytes/file metadata and that would be pretty efficient as compared to actually reading the whole file.

Moron
This is the reason I asked the question :)
dr. evil
Metadata reading is too much though you need a signature database etc. and for my task totally over engineering it.
dr. evil
@dr. evil. A file extension check would not be reasonable? I consider that file metadata. Anyway, I guess you have enough info to get on with your work :-)
Moron
As you said I think I've got enough info to start it, shame there is no easy to use .NET library for this purpose.
dr. evil
A: 

I think the best way of doing this is to take at most the first X bytes from the file (X could be 256, 512, etc), count the number of chars that are not used by ASCII files (ascii codes permitted are: 10, 13, 32-126). If you know for sure that the script is written in English, than no character can be outside of the mentioned set. If you are not sure about the language, than you may permit at most Y char to be outside of the set (if X is 512, I would choose Y to be 8 or 10).

If this is not good enough, you may use more constraints such as: depending on the syntax of the files, such keywords should be present (eg: for your batch files, there should be some echo, for, if, goto, call, exit, etc)

botismarius