views:

370

answers:

1

I'm working on a mysterious bug in the usually very good open source project Excel Data Reader. It's skipping values reading from my particular OpenXML .xlsx spreadsheet.

The problem is occurring in the ReadSheetRow method (demonstration code below). The source XML is saved by Excel and contains no whitespace which is when the strange behaviour occurs. However XML that has been reformatted with whitespace (e.g. in Visual Studio go to Edit, Advanced, Format Document) works completely fine!

Test data with whitespace:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"&gt;
    <sheetData>
        <row r="5" spans="1:73" s="7" customFormat="1">
            <c r="B5" s="12">
                <v>39844</v>
            </c>
            <c r="C5" s="8"/>
            <c r="D5" s="8"/>
            <c r="E5" s="8"/>
            <c r="F5" s="8"/>
            <c r="G5" s="8"/>
            <c r="H5" s="12">
                <v>39872</v>
            </c>
            <c r="I5" s="8"/>
            <c r="J5" s="8"/>
            <c r="K5" s="8"/>
            <c r="L5" s="8"/>
            <c r="M5" s="8"/>
            <c r="N5" s="12">
                <v>39903</v>
            </c>
        </row>
    </sheetData>
</worksheet>

Test data without whitespace:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"&gt;&lt;sheetData&gt;&lt;row r="5" spans="1:73" s="7" customFormat="1"><c r="B5" s="12"><v>39844</v></c><c r="C5" s="8"/><c r="D5" s="8"/><c r="E5" s="8"/><c r="F5" s="8"/><c r="G5" s="8"/><c r="H5" s="12"><v>39872</v></c><c r="I5" s="8"/><c r="J5" s="8"/><c r="K5" s="8"/><c r="L5" s="8"/><c r="M5" s="8"/><c r="N5" s="12"><v>39903</v></c></row></sheetData></worksheet>

Example code that demonstrates the problem:

Note that A is output after _xmlReader.Read(), B after ReadToDescendant, and C after ReadElementContentAsObject.

while (reader.Read())
{
    if (reader.NodeType != XmlNodeType.Whitespace) outStream.WriteLine(String.Format("*A* NodeType: {0}, Name: '{1}', Empty: {2}, Value: '{3}'", reader.NodeType, reader.Name, reader.IsEmptyElement, reader.Value));

    if (reader.NodeType == XmlNodeType.Element && reader.Name == "c")
    {
        string a_s = reader.GetAttribute("s");
        string a_t = reader.GetAttribute("t");
        string a_r = reader.GetAttribute("r");

        bool matchingDescendantFound = reader.ReadToDescendant("v");
        if (reader.NodeType != XmlNodeType.Whitespace) outStream.WriteLine(String.Format("*B* NodeType: {0}, Name: '{1}', Empty: {2}, Value: '{3}'", reader.NodeType, reader.Name, reader.IsEmptyElement, reader.Value));
        object o = reader.ReadElementContentAsObject();
        if (reader.NodeType != XmlNodeType.Whitespace) outStream.WriteLine(String.Format("*C* NodeType: {0}, Name: '{1}', Empty: {2}, Value: '{3}'", reader.NodeType, reader.Name, reader.IsEmptyElement, reader.Value));
    }
}

Test results for XML with whitespace:

*A* NodeType: XmlDeclaration, Name: 'xml', Empty: False, Value: 'version="1.0" encoding="UTF-8" standalone="yes"'
*A* NodeType: Element, Name: 'worksheet', Empty: False, Value: ''
*A* NodeType: Element, Name: 'sheetData', Empty: False, Value: ''
*A* NodeType: Element, Name: 'row', Empty: False, Value: ''
*A* NodeType: Element, Name: 'c', Empty: False, Value: ''
*B* NodeType: Element, Name: 'v', Empty: False, Value: ''
*A* NodeType: EndElement, Name: 'c', Empty: False, Value: ''
*A* NodeType: Element, Name: 'c', Empty: True, Value: ''
*B* NodeType: Element, Name: 'c', Empty: True, Value: ''
...

Test results for XML without whitespace:

*A* NodeType: XmlDeclaration, Name: 'xml', Empty: False, Value: 'version="1.0" encoding="UTF-8" standalone="yes"'
*A* NodeType: Element, Name: 'worksheet', Empty: False, Value: ''
*A* NodeType: Element, Name: 'sheetData', Empty: False, Value: ''
*A* NodeType: Element, Name: 'row', Empty: False, Value: ''
*A* NodeType: Element, Name: 'c', Empty: False, Value: ''
*B* NodeType: Element, Name: 'v', Empty: False, Value: ''
*C* NodeType: EndElement, Name: 'c', Empty: False, Value: ''
*A* NodeType: Element, Name: 'c', Empty: True, Value: ''
*B* NodeType: Element, Name: 'c', Empty: True, Value: ''
...

The pattern changes indicate an issue in ReadElementContentAsObject or possibly the location that ReadToDescendant moves the XmlReader to.

Does anyone know what might be happening here?

+1  A: 

It's fairly simple. As you can see from the output, the first time you're on the "B" line, you're positioned at the first 'v' Element. Then, you call ReadElementContentAsObject. That returns the text content of v, and "moves the reader past the end element tag." (of v). You are now pointing to a whitespace node if there is whitespace, or an EndElement node (of c) if there is not. Of course, your output doesn't print if it's whitespace. Either way, you then do a Read() and move on to the next element. In the case of the non-whitespace, you have lost the EndElement.

The problem is much worse in other situtations. When you do a ReadElementContentAsObject of a c (call it c1), you then move on the next c (c2). Then you do a Read, moving to c3, and lose c2 for good.

I'm not going to try to fix the real code. But it's clear what you need to worry about, moving the stream forward in more than one place. This is a common source of looping errors in general.

Matthew Flaschen