I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The problem is that since it's a growing file not every bit of data has the BOM in it.
Here's my question -- without prepending the BOM bytes to each set of data I have (because I don't have control on the source) can I can just look for null bytes that are inherent in UTF-16 \000, and then use that as my identifier instead of the BOM? Will this cause me headaches down the road?
My architecture involves a ruby web application logging the received data to a temporary file when my parser written in java picks it up.
Write now my identification/re-encoding code looks like this:
// guess encoding if utf-16 then
// convert to UTF-8 first
try {
FileInputStream fis = new FileInputStream(args[args.length-1]);
byte[] contents = new byte[fis.available()];
fis.read(contents, 0, contents.length);
if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
String asString = new String(contents, "UTF-16");
byte[] newBytes = asString.getBytes("UTF8");
FileOutputStream fos = new FileOutputStream(args[args.length-1]);
fos.write(newBytes);
fos.close();
}
fis.close();
} catch(Exception e) {
e.printStackTrace();
}
UPDATE
I want to support stuff like euros, em-dashes, and other characters as such. I modified the above code to look like this and it seems to pass all my tests for those characters:
// guess encoding if utf-16 then
// convert to UTF-8 first
try {
FileInputStream fis = new FileInputStream(args[args.length-1]);
byte[] contents = new byte[fis.available()];
fis.read(contents, 0, contents.length);
byte[] real = null;
int found = 0;
// if found a BOM then skip out of here... we just need to convert it
if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
found = 3;
real = contents;
// no BOM detected but still could be UTF-16
} else {
for(int cnt=0; cnt<10; cnt++) {
if(contents[cnt] == (byte)0x00) { found++; };
real = new byte[contents.length+2];
real[0] = (byte)0xFF;
real[1] = (byte)0xFE;
// tack on BOM and copy over new array
for(int ib=2; ib < real.length; ib++) {
real[ib] = contents[ib-2];
}
}
}
if(found >= 2) {
String asString = new String(real, "UTF-16");
byte[] newBytes = asString.getBytes("UTF8");
FileOutputStream fos = new FileOutputStream(args[args.length-1]);
fos.write(newBytes);
fos.close();
}
fis.close();
} catch(Exception e) {
e.printStackTrace();
}
What do you all think?