views:

310

answers:

3

I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The problem is that since it's a growing file not every bit of data has the BOM in it.

Here's my question -- without prepending the BOM bytes to each set of data I have (because I don't have control on the source) can I can just look for null bytes that are inherent in UTF-16 \000, and then use that as my identifier instead of the BOM? Will this cause me headaches down the road?

My architecture involves a ruby web application logging the received data to a temporary file when my parser written in java picks it up.

Write now my identification/re-encoding code looks like this:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

UPDATE

I want to support stuff like euros, em-dashes, and other characters as such. I modified the above code to look like this and it seems to pass all my tests for those characters:

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);
    byte[] real = null;

    int found = 0;

    // if found a BOM then skip out of here... we just need to convert it
    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      found = 3;
      real = contents;

    // no BOM detected but still could be UTF-16
    } else {

      for(int cnt=0; cnt<10; cnt++) {
        if(contents[cnt] == (byte)0x00) { found++; };

        real = new byte[contents.length+2];
        real[0] = (byte)0xFF;
        real[1] = (byte)0xFE;

        // tack on BOM and copy over new array
        for(int ib=2; ib < real.length; ib++) {
          real[ib] = contents[ib-2];
        }
      }

    }

    if(found >= 2) {
      String asString = new String(real, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

What do you all think?

+2  A: 

In general, you cannot identify the character encoding of a data stream with 100% accuracy. The best you can do is try to decode using a limited set of expected encodings, and then apply some heuristics to the decoded result to see if it "looks like" text in the expected language. (But any heuristic will give false positives and false negatives for certain data streams.) Alternatively, put a human in the loop to decide which decoding makes the most sense.

A better solution is to to redesign your protocol so that whatever is supplying the data has to also supply the encoding scheme used for the data. (And if you cannot, blame whoever is responsible for designing / implementing the system that cannot give you an encoding scheme!).

EDIT: from your comments on the question, the data files are being delivered via HTTP. In this case, you should arrange that your HTTP server snarfs the "content-type" header of the POST requests delivering the data, extract the character set / encoding from the header, and save it in a way / place that your file parser can deal with.

Stephen C
A: 

This will cause you headaches down the road, no doubt about it. You can check for alternating zero bytes for the simplistic cases (ASCII only, UTF-16, either byte order) but the minute you start getting a stream of characters above the 0x7f code point, that method becomes useless.

If you have the file handle, the best bet is to save the current file pointer, seek to the start, read the BOM then seek back to the original position.

Either that or remember the BOM somehow.

Relying on the data contents is a bad idea unless you're absolutely certain the character range will be restricted for all inputs.

paxdiablo
Relying on the BOM is a worse idea unless you're absolutely certain that the file will have one.
dan04
"The first bit of data written to it has the BOM available" was in the question so I _was_ absolutely certain :-)
paxdiablo
A: 

This question contains a few options for character detection which don't appear to require a BOM.

My project is currently using jCharDet but I might need to look at some of the other options listed there as jCharDet is not 100% reliable.

jwaddell
@jwaddell: No character detection scheme is going to be 100% reliable.
Stephen C