views:

49

answers:

2

I have a large set of MIME files, which contain multiple parts. Many of the files contain parts labelled with the following headers:

Content-Type: application/octet stream

Content-Transfer-Encoding: Binary

However, sometimes the contents of these parts are some form of binary code, and sometimes they are plaintext.

Is there a clever way in either C++, Bash or Ruby to detect whether the contents of a MIME part labelled as application/octet stream is binary data or plaintext?

Thanks, Rik

+1  A: 

The -I option of grep will treat binary files as files without a match. Combined with the -q option grep will return a nonzero exit status if a file is binary.

if grep -qI -e '' <file>
then
        # plaintext
else
        # binary
fi
Bart Sas
Thanks for the reply.
RikSaunderson
It's not the whole file that is binary, rather a portion of the file. We know that most of the file is in plain text. The mime files consist of some metadata and then some content parts. The content parts have the headers listed above, and are sometimes plain text, sometimes binary and sometimes HTTP.
RikSaunderson
A: 

The simplest method is to split the file into a set of multiple files each of which contains one of the component parts. We can then use grep and other functions to ascertain the text format.

RikSaunderson