views:

261

answers:

3

Hi Everyone!

I get files in different formats coming from different systems that I need to import into our database. Part of the import process it to check the line length to make sure the format is correct. We seem to be having issues with files coming from UNIX systems where one character is added. I suspect this is due to the return carriage being encoded differently on UNIX and windows platform.

Is there a way to detect on which file system a file was created, other than checking the last character on the line? Or maybe a way of reading the files as text and not binary which I suspect is the issue?

Thanks Guys !

+3  A: 

Unix systems use \n line endings while windows uses \r\n and mac uses \r. You cannot detect the file system since it doesn't matter at all. I can use \n on windows if my editor supports it for example. It's just the standard on those OS, not a requirement.

The proper way - assuming you don't have a function which properly tokenizes no matter what line ending the file uses - is to search for a \n OR a \r and then end the current line and strip all chars from the remaining data which are either \r or \n before you begin the next line. However, this will cause issues if you have blank lines and need to keep them. In this case you have to look at linebreaks more carefully:

  • when reading a \n, end the current line and start the next line
  • when reading a \r, end the current line and, if the next char is \n, skip it, and start the next line, otherwise start the new line immediately.
ThiefMaster
Classic Mac OS used \r for its line terminator. Current versions of Mac OS (basically anything released in the last 10 years) uses \n
Rulmeq
Thanks for your answer I thought this may have been the only way ...
rafrafUk
+1  A: 

Most of the time Java will handle differing types of line endings automatically, silently parsing \n (unix) \r\n (windows) and \r (mac) without bothering you (as long as you're using a character stream). See the docs for java.io.FileReader and friends. Using a character stream will also handle all of the possible Unicode encoding schemes.

If you want to read the line separators explicitly, you'll need to read the file as a byte stream. See the docs for java.io.DataInputStream and friends.

Craig Trader
A: 

Is there a way to detect on which file system a file was created, other than checking the last character on the line?

No. And even checking the line termination sequence is only a hint. We can easily create files with DOS line termination on UNIX, and vice versa.

Or maybe a way of reading the files as text and not binary which I suspect is the issue?

Yes. Open the file using a file reader, wrap it in a buffered reader, and use the readLine() method to read the file a line at a time. This method recognizes a "\n", "\r" or "\r\n" as a line separator, and hence works for DOS, UNIX and Mac files.

Here's some typical code:

    Reader r = new FileReader("somefile");
    try {
        BufferedReader br = new BufferedReader(r);
        String line;
        while ((line = r.readLine()) != null) {
            // process line
        }
    } finally {
        r.close();
    }
Stephen C