views:

69

answers:

7

Hello all.

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?

The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"

thank you.

public boolean isAsciiText(String fileName) throws IOException {

    InputStream in = new FileInputStream(fileName);
    byte[] bytes = new byte[500];

    in.read(bytes, 0, bytes.length);
    int x = 0;
    short bin = 0;

    for (byte thisByte : bytes) {
        char it = (char) thisByte;
        if (!Character.isWhitespace(it) && Character.isISOControl(it)) {

            bin++;
        }
        if (bin >= 5) {
            return false;
        }
        x++;
    }
    in.close();
    return true;
}
+2  A: 

x doesn't appear to do anything.

What if the file is less than 500 bytes?

Some binary files have a situation where you can have a header for the first N bytes of the file which contains some data that is useful for an application but that the library the binary is for doesn't care about. You could easily have 500+ bytes of ASCII in a preamble like this followed by binary data in the following gigabyte.

Should handle exception if the file can't be opened or read, etc.

Dave
+1  A: 

The first thing I noticed - unrelated to your actual question, but you should be closing your input stream in a finally block to ensure it's always done. Usually this merely handles exceptions, but in your case you won't even close the streams of files when returning false.

Asides from that, why the comparison to ISO control characters? That's not a "binary" file, that's a "file that contains 5 or more control characters". A better way to approach the situation in my opinion, would be to invert the check - write an isAsciiText function instead which asserts that all the characters in the file (or in the first 500 bytes if you so wish) are in a set of bytes that are known good.

Theoretically, only checking the first few hundred bytes of a file could get you into trouble if it was a composite file of sorts (e.g. text with embedded pictures), but in practice I suspect every such file will have binary header data at the start so you're probably OK.

Andrzej Doyle
A: 
  1. You ignore what read() returns, what if the files is shorter than 500 bytes?
  2. When you return false, you don't close the file.
  3. When converting byte to char, you assume your file is 7-bit ASCII.
unbeli
+1  A: 

Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:

if (thisByte < 32 || thisByte > 127) bin++;
Pointy
A: 

This would not work with the jdk install packages for linux or solaris. they have a shell-script start and then a bi data blob.

why not check the mime type using some library like jMimeMagic (http://http://sourceforge.net/projects/jmimemagic/) and deside based on the mimetype how to handle the file.

Nikolaus Gradwohl
A: 
  1. Fails badly if file size is less than 500 bytes

  2. The line char it = (char) thisByte; is conceptually dubious, it mixes bytes and chars concepts, ie. assumes implicitly that the encoding is one-byte=one character (them, it excludes unicode encodings). In particular, it fails if the file is UTF-16 encoded.

  3. The return inside the loop (slightly bad practice IMO) forgets to close the file.

leonbloy
A: 

Why do you need to know this?

If you're not sure what a file contains, process it as bytes, with InputStreams and OutputStreams. If you know by construction with 100% certainty that the file must only contain valid characters, process it as characters with Readers and Writers.

EJP