views:

164

answers:

2

(Disclaimer: I looked at a number of posts on here before asking, I found this one particularly helpful, I was just looking for a bit of a sanity check from you folks if possible)

Hi All,

I have an internal Java product that I have built for processing data files for loading into a database (AKA an ETL tool). I have pre-rolled stages for XSLT transformation, and doing things like pattern replacing within the original file. The input files can be of any format, they may be flat data files or XML data files, you configure the stages you require for the particular datafeed being loaded.

I have up until now ignored the issue of file encoding (a mistake I know), because all was working fine (in the main). However, I am now coming up against file encoding issues, to cut a long story short, because of the nature of the way stages can be configured together, I need to detect the file encoding of the input file and create a Java Reader object with the appropriate arguments. I just wanted to do a quick sanity check with you folks before I dive into something I can't claim to fully comprehend:

  1. Adopt a standard file encoding of UTF-16 (I'm not ruling out loading double-byte characters in the future) for all files that are output from every stage within my toolkit
  2. Use JUniversalChardet or jchardet to sniff the input file encoding
  3. Use the Apache Commons IO library to create a standard reader and writer for all stages (am I right in thinking this doesn't have a similar encoding-sniffing API?)

Do you see any pitfalls/have any extra wisdom to offer in my outlined approach?

Is there any way I can be confident of backwards compatibility with any data loaded using my existing approach of letting the Java runtime decide the encoding of windows-1252?

Thanks in advance,

-James

+1  A: 

Option 1 strikes me as breaking backwards compatibility (certainly in the long run), although the "right way" to go (the right way option generally does break backwards compatibility) with perhaps additional thoughts about if UTF-8 would be a good choice.

Sniffing the encoding strikes me as reasonable if you have a limited, known set of encodings that you tested to know that your sniffer correctly distinguishes and identifies.

Another option here is to use some form of meta-data (file naming convention if nothing else more robust is an option) that lets your code know that the data was provided according to the UTF-16 standard and behave accordingly, otherwise convert it to the UTF-16 standard before moving forward.

Yishai
Thanks, that's just the sort of sanity checking I am after. As far as backwards compatibility goes when using UTF-16 for all output files, I'm thinking I may have to bite the bullet and do a full regression test, shame on me for not addressing this in the first place.As an aside, I thought about using some meta-data/config, but I'm trying to make the toolkit as config-light as possible (we've all drowned in config files at some point in our life!)
James B
+2  A: 

With flat character data files, any encoding detection will need to rely on statistics and heuristics (like the presence of a BOM, or character/pattern frequency) because there are byte sequences that will be legal in more than one encoding, but map to different characters.

XML encoding detection should be more straightforward, but it is certainly possible to create ambiguously encoded XML (e.g. by leaving out the encoding in the header).

It may make more sense to use encoding detection APIs to indicate the probability of error to the user rather than rely on them as decision makers.

When you transform data from bytes to chars in Java, you are transcoding from encoding X to UTF-16(BE). What gets sent to your database depends on your database, its JDBC driver and how you've configured the column. That probably involves transcoding from UTF-16 to something else. Assuming you're not altering the database, existing character data should be safe; you might run into issues if you intend parsing BLOBs. If you've already parsed files written in disparate encodings, but treated them as another encoding, the corruption has already taken place - there are no silver bullets to fix that. If you need to alter the character set of a database from "ANSI" to Unicode, that might get painful.

Adoption of Unicode wherever possible is a good idea. It may not be possible, but prefer file formats where you can make encoding unambiguous - things like XML (which makes it easy) or JSON (which mandates UTF-8).

McDowell
Thanks, that's another good reply. Having slept on it, I reached a similar conclusion. I think I may be able to (sort-of) break backwards compatibility by adopting a standard intermediate file encoding, as long as the database load stays the same and loads the data in the same format (this will keep any bugs in the existing data, but this constraint has been (understandably) imposed from above for business reasons)...The main bulk of the code I'm going to have to write is regression testing code, which I find is becoming more and more the case with the trend for using more frameworks/libraries
James B