views:

617

answers:

6

I'm working with Quickbook's IIF file format and I need to write a parser to read and write IIF files and I'm running into some issues reading the files.

The files are simple, they're tab deliminated. Every line is either a table definition or a row. Definitions begin with'!' and the table name, and rows begin with just the table name. Here's the problem I'm running into: some of the fields allow line breaks.

When I first encountered this, I thought, okay just parse it tab by tab instead of line by line, but to do that I had to replace the line breaks with tabs, and wound up with more values than there were columns, but I wound up with the values with line breaks spread out across too many columns.

How would you parse such a file?

Edit: An example

!CUST   NAME REFNUM TIMESTAMP BADDR1 BADDR2 BADDR3 BADDR4 BADDR5 SADDR1 SADDR2 SADDR3 SADDR4 SADDR5 PHONE1 PHONE2 FAXNUM CONT1 CONT2 CTYPE TERMS TAXABLE LIMIT RESALENUM REP TAXITEM NOTEPAD SALUTATION COMPANYNAME FIRSTNAME MIDINIT LASTNAME CUSTFLD1 CUSTFLD2 CUSTFLD3 CUSTFLD4 CUSTFLD5 CUSTFLD6 CUSTFLD7 CUSTFLD8 CUSTFLD9 CUSTFLD10 CUSTFLD11 CUSTFLD12 CUSTFLD13 CUSTFLD14 CUSTFLD15 JOBDESC JOBTYPE JOBSTATUS JOBSTART JOBPROJEND JOBEND HIDDEN DELCOUNT
CUST    St. Mark 359 1176670332 Saint Mark Catholic Church 609 W Main St City, State Zip 
!CLASS  NAME REFNUM TIMESTAMP HIDDEN DELCOUNT
!INVITEM    NAME REFNUM TIMESTAMP INVITEMTYPE DESC PURCHASEDESC ACCNT ASSETACCNT COGSACCNT QNTY QNTY PRICE COST TAXABLE PAYMETH TAXVEND TAXDIST PREFVEND REORDERPOINT EXTRA CUSTFLD1 CUSTFLD2 CUSTFLD3 CUSTFLD4 CUSTFLD5 DEP_TYPE ISPASSEDTHRU HIDDEN DELCOUNT USEID
INVITEM Labor 1 1119915308 SERV Labor                        0
!TIMEACT    DATE JOB EMP ITEM PITEM DURATION PROJ NOTE XFERTOPAYROLL BILLINGSTATUS
TIMEACT 3/8/08 876 Development Jane Doe {Consultant} Labor  00:15  Renewing all domain name for 876 Development.
REIMBURSEMENT: 44.72 for one year renewal on all domain names.  N 1
TIMEACT 3/17/08 Greg:Bridge Jane Doe {Consultant} Labor  01:00  Preparing Studio N 1
TIMEACT 3/17/08 John Doe and Associates Jane Doe {Consultant} Labor  00:06  REIMBURSEMENT: Toner cartridge on ebay & Fuser from FastPrinters- ask wendell before invoicing to see if this fixed the problem
49.99 (include tax) toner
$175.18 (include tax) fuser
    N 1
TIMEACT 3/17/08 John Doe II Jane Doe {Consultant} Labor  01:00  Fixing Kandis's computer - replaced entire computer with similar system N 1
A: 

Why not replace line breaks with spaces instead of tabs?

Dmitri Nesteruk
because I want to retain the line breaks
Malfist
A: 

Maybe this can help you.

Or perhaps a quick grammar from ANTLR.

duffymo
Both are java and cannot be easily integrated into a .NET program unless I'm mistaken
Malfist
Correct, but the original post never mentioned .NET.
duffymo
But now I see that there WAS a C# tag that I missed. My apologies. I'll look closer next time.
duffymo
+1  A: 

It has been a while since I have done IIF but unless they have fixed it QuickBooks will barf on those line breaks anyway. It seems these folks have the same issue and they handled it with spaces.

Personally I would lean toward pipes or something that will clearly delineate the line break when it comes into QuickBooks. If you absolutely positively must have the line breaks, join the Intuit Developer Network and use the SDK to send these values to QB once your program imports them.

nshaw
These line breaks come from Quickbooks itself, and it will not barf on them. Using the qbXML isn't an option for this program, I wish it was :(
Malfist
Oh OK. So you are exporting from QB? Sorry. Everyone is usually trying to import so I assumed you were doing the same. In that case I would probably parse by token - just capture everything between one TIMEACT and the next.
nshaw
well I'm doing both exporting and importing. That snippit above came from a file that quickbooks has already imported.
Malfist
Complete with line breaks.
Malfist
A: 

I run into this sort of thing all the time. The key is to handling special cases like this when you're doing parsing is to replace the special case with something that's extremely unlikely to occur in the text, and then replace it again when you're done.

For example, you have line breaks in the output which can be readily detected with a Regex. Use Regex.Replace to convert them to something like LINEBREAK. Make it something that stands out in an editor for debugging. Then do the rest of your parsing as normal, and as the last step, replace the special token with the original value (or something new).

Wade Hatler
+4  A: 

I did it:

 public DataSet parseIIF(Stream file) {
            iifSet = new DataSet();
            String fileText;

            using (StreamReader sr = new StreamReader(file)) {
                fileText = sr.ReadToEnd();
            }
            //replace line breaks with tabs
            //fileText.Replace('\n', '\t');
            fileText.Replace("\r\n", "\n");
            fileText.Replace('\r', '\n');

            //split along tabs
            string[] lines = fileText.Split('\n');

            this.createTables(lines, iifSet);
            this.fillSet(lines, iifSet);

            return iifSet;
        }

        /// <summary>
        /// Reads an array of lines and parses them into tables for the dataset
        /// </summary>
        /// <param name="lines">String Array of lines from the iif file</param>
        /// <param name="iifSet">DataSet to be manipulated</param>
        private void fillSet(string[] lines, DataSet set) {
            //CODING HORROR
            //WARNING: I will monkey with the for loop index, be prepared!
            for (int i = 0; i < lines.Length; i++) {
                if (this.isTableHeader(lines[i])) {
                    //ignore this line, tables are alread defined
                    continue;
                }
                if (lines[i] == "" || lines[i] == "\r" || lines[i] == "\n\r" || lines[i] == "\n") {
                    //ignore lines that are empty if not inside a record
                    //probably the end of the file, it always ends with a blank line break
                    continue;
                }

                if (lines[i].IndexOf(";__IMPORTED__") != -1) {
                    continue;
                    //just signifying that it's been imported by quickbook's timer before, don't need it
                }

                string line = lines[i];
                while (!isFullLine(line, set)){
                    i++;            //<--------------------------- MONKEYING done here!
                    line += lines[i];       
                }
                //now, the line should be complete, we can parse it by tabs now
                this.parseRecord(line, set);
            }
        }

        private void parseRecord(string line, DataSet set) {
            if (isTableHeader(line)) {
                //we don't want to deal with headers here
                return;
            }

            String tablename = line.Split('\t')[0];
            //this just removes the first value and the line break from the last value
            String[] parameters = this.createDataRowParams(line);

            //add it to the dataset
            set.Tables[tablename].Rows.Add(parameters);
        }

        private bool isFullLine(string line, DataSet set) {
            if (isTableHeader(line)) {
                return true;    //assumes table headers won't have line breaks
            }
            int values = line.Split('\t').Length;
            string tableName = line.Split('\t')[0];
            int columns = set.Tables[tableName].Columns.Count;

            if (values < columns) {
                return false;
            } else {
                return true;
            }
        }

        private void createTables(string[] lines, DataSet set) {
            for (int index = 0; index < lines.Length; index++) {
                if (this.isTableHeader(lines[index])) {
                    set.Tables.Add(createTable(lines[index]));
                }
            }
        }

        private bool isTableHeader(string tab) {
            if (tab.StartsWith("!"))
                return true;
            else
                return false;
        }

        private bool isNewLine(string p) {
            if (p.StartsWith("!"))
                return true;
            if (iifSet.Tables[p.Split('\t')[0]] != null)    //that little mess there grabs the first record in the line, sorry about the mess
                return true;
            return false;
        }

    private DataTable createTable(string line) {
        String[] values = line.Split('\t');

        //first value is always the title
        //remove the ! from it
        values[0] = values[0].Substring(1);     //remove the first character
        DataTable dt = new DataTable(values[0]);
        values[0] = null;   //hide first title so it doesn't get used, cheaper than resetting array
        foreach (String name in values) {
            if (name == null || name == "")
                continue;
            DataColumn dc = new DataColumn(name, typeof(String));
            try {
                dt.Columns.Add(dc);
            } catch (DuplicateNameException) {
                //odd
                dc = new DataColumn(name + "_duplicateCol" + dt.Columns.Count.ToString());
                dt.Columns.Add(dc);
                //if there is a triple, just throw it
            }
        }

        return dt;
    }

   private string getTableName(string line) {
        String[] values = line.Split('\t');

        //first value is always the title
        if(values[0].StartsWith("!")){
            //remove the ! from it
            values[0] = values[0].Substring(1);     //remove the first character
        }
        return values[0];
    }

    private string[] createDataRowParams(string line) {
        string[] raw = line.Split('\t');
        string[] values = new string[raw.Length - 1];

        //copy all values except the first one
        for (int i = 0; i < values.Length; i++) {
            values[i] = raw[i + 1];
        }

        //remove last line break from the record
        if (values[values.Length - 1].EndsWith("\n")) {
            values[values.Length - 1] = values[values.Length - 1].Substring(0, values[values.Length - 1].LastIndexOf('\n'));
        } else if (values[values.Length - 1].EndsWith("\n\r")) {
            values[values.Length - 1] = values[values.Length - 1].Substring(0, values[values.Length - 1].LastIndexOf("\n\r"));
        } else if (values[values.Length - 1].EndsWith("\r")) {
            values[values.Length - 1] = values[values.Length - 1].Substring(0, values[values.Length - 1].LastIndexOf('\r'));
        }


        return values;
    }

    private string[] createDataRowParams(string line, int max) {
        string[] raw = line.Split('\t');

        int length = raw.Length - 1;
        if (length > max) {
            length = max;
        }

        string[] values = new string[length];
        for (int i = 0; i < length; i++) {
            values[i] = raw[i + 1];
        }

        if (values[values.Length - 1].EndsWith("\n")) {
            values[values.Length - 1] = values[values.Length - 1].Substring(0, values[values.Length - 1].LastIndexOf('\n'));
        } else if (values[values.Length - 1].EndsWith("\n\r")) {
            values[values.Length - 1] = values[values.Length - 1].Substring(0, values[values.Length - 1].LastIndexOf("\n\r"));
        } else if (values[values.Length - 1].EndsWith("\r")) {
            values[values.Length - 1] = values[values.Length - 1].Substring(0, values[values.Length - 1].LastIndexOf('\r'));
        }

        return values;
    }
Malfist
A: 

Ideas:

  1. Preprocess your file, replacing line break (assuming it's a single CR or LF) with some high-ascii character. Then parse by tab and finally replace said high-ascii with the line breaker afterwards.

  2. Rather than process line by line, process character by character. Mind you, that still only works if the embedded line breaks are somehow different from the standard CRLF at the end of a record.

boost