tags:

views:

88

answers:

5

Possible Duplicate:
linux + verify if file is binary or text

hi all

how to verify if the file is binary or text without to open the file

remark: the solution must be absolute

lidia

+2  A: 

There is no way of being certain without looking inside the file. Hoewever, you don't have to open it with an editor and see for yourself to have a clue. You may want to look into the file command: http://linux.die.net/man/1/file

René Wolferink
+5  A: 

Schrödinger's cat, I'm afraid.

There is no way to determine the contents of a file without opening it. The filesystem stores no metadata relating to the contents.

If not opening the file is not a hard requirement, then there are a number of solutions available to you.

Edit:

It has been suggested in a number of comments and answers that file(1) is a good way of determining the contents. Indeed it is. However, file(1) opens the file, which was prohibited in the question. See the penultimate line in the following example:

> echo 'This is not a pipe' > file.jpg && strace file file.jpg 2>&1 | grep file.jpg
execve("/usr/bin/file", ["file", "file.jpg"], [/* 56 vars */]) = 0
lstat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
stat64("file.jpg", {st_mode=S_IFREG|0644, st_size=19, ...}) = 0
open("file.jpg", O_RDONLY|O_LARGEFILE)  = 3
write(1, "file.jpg: ASCII text\n", 21file.jpg: ASCII text
Johnsyweb
The unix command file does a good job at heuristically determining the type
Joel
@Joel: Yes it does. It also opens the file.
Johnsyweb
The question is too vague to know if "open" means open(2). "Open" has other connotations.
camh
True enough, @camh. I take 'open' to mean "examine the contents of" the file. Perhaps @lidia is interested in knowing whether a file on which the user has no read permissions is text or binary. 'file', etcetera would be of no use here.
Johnsyweb
+1  A: 

If you are attempting to do this from a command shell then the file command will take a guess at what filetype it is. If it is text then it will generally include the word text in its description.

I am not aware of any 100% method of determining this but the file command is probably the most accurate.

Steve Weet
Of course that opens the file, and won't be 100% certain.
Douglas Leeder
Indeed it does, although I wasn't sure whether he was averse to opening the file himself or having a utility open it. I have stated that there is no 100% certain method of doing this.
Steve Weet
+1  A: 

In unix, a file is just some bytes. So, without opening the file, you cannot figure out 100% that's it's ASCII or Binary.

You can just use tools available to you and dig deeper to make it fool proof.

  1. file
  2. cat -v
zengr
+2  A: 

The correct way to determine the type of a file is to use the file(1) command.

You also need to be aware that UTF-8 encoded files are "text" files, but may contain non-ASCII data. Other encodings also have this issue. In the case of text encoded with a code page, it may not be possible to unambiguously determine if a file is text or not.

The file(1) command will look at the structure of a file to try and determine what it contains - from the file(1) man page:

The type printed will usually contain one of the words text (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), executable (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or data meaning anything else (data is usually ‘binary’ or non-printable).

With regard to different character encodings, the file(1) man page has this to say:

If a file does not match any of the entries in the magic file, it is examined to see if it seems to be a text file. ASCII, ISO-8859-x, non- ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set. If a file passes any of these tests, its character set is reported. ASCII, ISO-8859-x, UTF-8, and extended-ASCII files are identified as ‘text’ because they will be mostly readable on nearly any terminal; UTF-16 and EBCDIC are only ‘character data’ because, while they contain text, it is text that will require translation before it can be read.

So, some text will be identified as text, but some may be identified as character data. You will need to determine yourself if this matters to your application and take appropriate action.

camh