How can I determine if a file is binary or text in c#?

views:

896

answers:

+6 Q:

How can I determine if a file is binary or text in c#?

I need to determine in 80% if a file is binary or text, is there any way to do it even quick and dirty/ugly in c#?

+1 A:

Quick and dirty is to use the file extension and look for common, text extensions such as .txt. For this, you can use the Path.GetExtension call. Anything else would not really be classed as "quick", though it may well be dirty.

Jeff Yates 2009-05-26 14:10:20

Sometimes guys like me can change the extension of a binary file to .txt

Kirtan 2009-05-26 14:11:48

Obviously, but he asked for cheap and dirty - there's no foolproof way but to ask a person to read it.

Jeff Yates 2009-05-26 14:19:27

that's good, unfortunatelly I'm not dealing with common extensions, I'm writing some kind of list of all files and need to categorize them bin or text, most people do it but hand but as I am lazy I prefer to write code.

pablito 2009-05-26 14:31:31

+3 A:

I would probably look for an abundance of control characters which would typically be present in a binary file but rarely in an text file. Binary files tend to use 0 enough that just testing for many 0 bytes would probably be sufficient to catch most files. If you care about localization you'd need to test multi-byte patterns as well.

As stated though, you can always be unlucky and get a binary file that looks like text or vise versa.

Ron Warholic 2009-05-26 14:16:45

Thanks, I looked for 4 consecutived nulls "\0\0\0\0" binary files seem to have a lot of them so I tested it in 50 random files and it works.

pablito 2009-05-26 14:42:46

+8 A:

There's a method called Markov Chains. Scan a few model files of both kinds and for each byte value from 0 to 255 gather stats (basically probability) of a subsequent value. This will give you a profile you can compare your runtime files against.

This is how browsers' Auto-Detect Encoding feature works.

zvolkov 2009-05-26 14:16:52

Thanks, I did something similar, I looked for a consecutive number of nulls.

pablito 2009-05-26 14:44:02

A really really really dirty way would be to build a regex that takes only standard text, punctuation, symbols, and whitespace characters, load up a portion of the file in a text stream, then run it against the regex. Depending on what qualifies as a pure text file in your problem domain, no successful matches would indicate a binary file.

To account for unicode, make sure to mark the encoding on your stream as such.

This is really suboptimal, but you said quick and dirty.

Chad Ruppert 2009-05-26 14:24:16

http://codesnipers.com/?q=node/68 describes how to detect UTF-16 vs. UTF-8 using a Byte Order Mark (which may appear in your file). It also suggests looping through some bytes to see if they conform to the UTF-8 multi-byte sequence pattern (below) to determine if your file is a text file.

0xxxxxxx ASCII < 0x80 (128)
110xxxxx 10xxxxxx 2-byte >= 0x80
1110xxxx 10xxxxxx 10xxxxxx 3-byte >= 0x400
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4-byte >= 0x10000

foson 2009-05-26 14:46:13

This works if the file is guaranteed to be UTF8/16, or binary. But what if it is neither? What if it is a Text file, encoded in neither ASCII nor UTF-8/16. What if it is encoded in the Big5 code page? Or ISO-8859-1? These have no BOM. So... how to cover that case as well?

Cheeso 2009-05-26 14:56:16

ansaurus

tags:

views:

answers:

How can I determine if a file is binary or text in c#?

related questions