views:

66

answers:

2

I'm reading an Excel document via the DocumentFormat.OpenXml library. Is there a good way to find out how many columns it has?

The current code, which I've just come across while investigating a bug, does this:

public string getMaxColumnName(SheetData aSheetData)
{
    string lLastCellReference = aSheetData.Descendants<Cell>().Last().CellReference.InnerText;
    char[] lRowNumberIndex = lLastCellReference.IndexOfAny(new char[] { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' });
    return lLastCellReference.Substring(0, lRowNumberIndex);   
}

In English: find the last cell in the sheet, get its cell reference (like "CB99"), and retrieve everything before the first digit. The problem is that the last cell in the sheet is not necessarily in the rightmost column.

I have a test sheet that is a neat, rectangular table. It has 1000 rows filling columns A through M, so the function is supposed to return the string "M". But because there is an extraneous space character in cell C1522, that's counted as the last cell, so the function reports the max column as "C".

My initial impulse was to just replace that Last() call with some kind of Max(columnNumber). However, Cell apparently does not expose an actual column number, only this composite CellReference string. I don't think I want to be doing string-splitting inside a predicate there.

Is there a way to find the sheet's rightmost column, without having to parse the CellReference of every single cell?

+1  A: 

As I understand the format, there are various cases:

  1. If the file is not generated by Excel and the worksheet contains data in a way that there are no blank rows and there are no blank column within a row, but not necesarily every row has the same number of columns (which may be the case):

    You are pretty much screwed. The format allow for rows and cells references to be ignored in this case. You have to count all cell references in each row to get the maximum.

  2. If the file is not generated by Excel, but cells are populated sparse (which apparently is not the case):

    The last cell of each row holds the reference of the column it must be in the "r" attribute. You will have to convert the reference, though.

  3. If the file is generated by Excel:

    Usually, and I haven't found an Excel-generated file that doesn't, the worksheet part has a child named dimension, which has a "ref" attribute with the cell reference used by the worksheet i.e. "A1:M1001". It is only a case of using this to know the columns. Of course, it works only if the extraneous character does not come in a column after the table.

    Alternatively, every row usually, and every Excel-generated file I have seen has it, has an attribute called "spans" that has the columns that row uses. The "spans" attribute format is numeric, so in your example it would have a value "1:13" for every row in the table. Maybe you only have to check the first row this way.

Wilhelm
A: 

I have concluded that this is the wrong thing to do in the first place. The consuming code is never actually looking for the rightmost cell in the whole sheet. Generally, what it wants is the number of cells in a particular row-- either row 1, or a known table header location.

In fact, with the possible exception of rendering or printing, I can't come up with any situation where getting the whole sheet's max cell is necessary.

Therefore, I need to refactor slightly. I'm changing the function so it takes a sheet and a row index, and returns the column of the rightmost cell in that row. That is, it will now look like:

public string getMaxColumnIndex(SheetData aSheetData, int aRowIndex);

For the implementation of that, I can check the Row.Spans property when it exists, or else parse the cell reference of Row.ChildElements.Last().

Auraseer