How to identify the file content is in ASCII or binary file using c++?
Well, this depends on your definition of ASCII. You can either check for values with ASCII code <128 or for some charset you define (e.g. 'a'-'z','A'-'Z','0'-'9'...) and treat the file as binary if it contains some other characters.
You could also check for regular linebreaks (0x10 or 0x13,0x10) to detect text files.
The contents of every file is binary. So, knowing nothing else, you can't be sure.
ASCII is a matter of interpretation. If you open a binary file in a text editor, you see what I mean.
Most binary files contain a fixed header (per type) you can look for, or you can take the file extension as a hint. You can look for byte order marks if you expect UTF-encoded files, but they are optional as well.
Unless you define your question more closely, there can't be a definitive answer.
You iterate through it using a normal loop with stream.get(), and check whether the byte values you read are <= 127
. One way of many ways to do it:
int c;
std::ifstream a("file.txt");
while((c = a.get()) != EOF && c <= 127)
;
if(c == EOF) {
/* file is all ASCII */
}
However, as someone mentioned, all files are binary files after all. Additionally, it's not clear what you mean by "ascii". If you mean the character code, then indeed this is the way you go. But if you mean only alphanumeric values, you would need for another way to go.
Have a look a how the file command works ; it has three strategies to determine the type of a file:
- filesystem tests
- magic number tests
- and language tests
Depending on your platform, and the possible files you're interested in, you can look at its implementation, or even invoke it.
My text editor decides on the presence of null bytes. In practice, that works really well: a binary file with no null bytes is extremely rare.
To check, you must open the file as binary. You can't open the file as text. ASCII is effectively a subset of binary. After that, you must check the byte values. ASCII has byte values 0-127, but 0-31 are control characters. TAB, CR and LF are the only common control characters. You can't (portably) use 'A' and 'Z'; there's no guarantee those are in ASCII (!). If you need them, you'll have to define.
const unsigned char ASCII_A = 0x41; // NOT 'A'
const unsigned char ASCII_Z = ASCII_A + 25;
If the question is genuinely how to detect just ASCII, then litb's answer is spot on. However if san was after knowing how to determine whether the file contains text or not, then the issue becomes way more complex. ASCII is just one - increasingly unpopular - way of representing text. Unicode systems - UTF16, UTF32 and UTF8 have grown in popularity. In theory, they can be easily tested for by checking if the first two bytes are the unicocde byte order mark (BOM) 0xFEFF (or 0xFFFE if the byte order is reversed). However as those two bytes screw up many file formats for Linux systems, they cannot be guaranteed to be there. Further, a binary file might start with 0xFEFF.
Looking for 0x00's (or other control characters) won't help either if the file is unicode. If the file is UFT16 say, and the file contains English text, then every other character will be 0x00.
If you know the language that the text file will be written in, then it would be possible to analyse the bytes and statistically determine if it contains text or not. For example, the most common letter in English is E followed by T. So if the file contains lots more E's and T's than Z's and X's, it's likely text. Of course it would be necessary to test this as ASCII and the various unicodes to make sure.
If the file isn't written in English - or you want to support multiple languages - then the only two options left are to look at the file extension on Windows and to check the first four bytes against a database of "magic file" codes to determine the file's type and thus whether it contains text or not.
This question really has no right or wrong answer to it, just complex solutions that will not work for all possible text files.
Here is a link the a The Old New Thing Article on how notepad detects the type of ascii file. It's not perfect, but it's interesting to see how Microsoft handle it.