views:

44

answers:

2

What path would you took to parse a large XML file (2MB - 20 MB or more), that does not have a schema (I cannot infer one with XSD.exe because the file structure is odd, check the snippet below)?

Options

1) XML Deserialization (but as said, I don't have a schema and XSD tool complains about the file contents), 2) Linq to XML, 3) loading into XmlDocument, 4) Manual parsing with XmlReader & stuff.

This is XML file snippet:

<?xml version="1.0" encoding="utf-8"?>
<xmlData date="29.04.2010 12:09:13">
 <Table>
  <ident>079186</ident>
  <stock>0</stock>
  <pricewotax>33.94000000</pricewotax>
  <discountpercent>0.00000000</discountpercent>
 </Table>
 <Table>
  <ident>079190</ident>
  <stock>1</stock>
  <pricewotax>10.50000000</pricewotax>
  <discountpercent>0.00000000</discountpercent>
  <pricebyquantity>
   <Table>
    <quantity>5</quantity>
    <pricewotax>10.00000000</pricewotax>
    <discountpercent>0.00000000</discountpercent>
   </Table>
   <Table>
    <quantity>8</quantity>
    <pricewotax>9.00000000</pricewotax>
    <discountpercent>0.00000000</discountpercent>
   </Table>
  </pricebyquantity>
 </Table>
</xmlData>
A: 

I would load it into an XmlDocument and then use XPath to process it accordingly. LINQ may be the best bet here, but I am not very familiar with it so I can't say.

Josh Stodola
I read somewhere that loading into XmlDocument could result in high memory consumption but I am not sure about it.
mare
Yes, it will have to load the entire file into memory. But 2-20MB should not be a major concern in this case.
Josh Stodola
A: 

Here's the XSD:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"&gt;
  <xs:element name="xmlData">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" name="Table">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="ident" type="xs:int" />
              <xs:element name="stock" type="xs:int" />
              <xs:element name="pricewotax" type="xs:double" />
              <xs:element name="discountpercent" type="xs:double" />
              <xs:element minOccurs="0" name="pricebyquantity">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element maxOccurs="unbounded" name="Table">
                      <xs:complexType>
                        <xs:sequence>
                          <xs:element name="quantity" type="xs:int" />
                          <xs:element name="pricewotax" type="xs:double" />
                          <xs:element name="discountpercent" type="xs:double" />
                        </xs:sequence>
                      </xs:complexType>
                    </xs:element>
                  </xs:sequence>
                </xs:complexType>
              </xs:element>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="date" type="xs:string" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

Here's the serializable class:

//------------------------------------------------------------------------------
// <auto-generated>
//     This code was generated by a tool.
//     Runtime Version:2.0.50727.3603
//
//     Changes to this file may cause incorrect behavior and will be lost if
//     the code is regenerated.
// </auto-generated>
//------------------------------------------------------------------------------

// 
// This source code was auto-generated by xsd, Version=2.0.50727.1432.
// 
namespace StockInfo {
    using System.Xml.Serialization;


    /// <remarks/>
    [System.CodeDom.Compiler.GeneratedCodeAttribute("xsd", "2.0.50727.1432")]
    [System.SerializableAttribute()]
    [System.Diagnostics.DebuggerStepThroughAttribute()]
    [System.ComponentModel.DesignerCategoryAttribute("code")]
    [System.Xml.Serialization.XmlTypeAttribute(AnonymousType=true)]
    [System.Xml.Serialization.XmlRootAttribute(Namespace="", IsNullable=false)]
    public partial class xmlData {

        private xmlDataTable[] tableField;

        private string dateField;

        /// <remarks/>
        [System.Xml.Serialization.XmlElementAttribute("Table")]
        public xmlDataTable[] Table {
            get {
                return this.tableField;
            }
            set {
                this.tableField = value;
            }
        }

        /// <remarks/>
        [System.Xml.Serialization.XmlAttributeAttribute()]
        public string date {
            get {
                return this.dateField;
            }
            set {
                this.dateField = value;
            }
        }
    }

    /// <remarks/>
    [System.CodeDom.Compiler.GeneratedCodeAttribute("xsd", "2.0.50727.1432")]
    [System.SerializableAttribute()]
    [System.Diagnostics.DebuggerStepThroughAttribute()]
    [System.ComponentModel.DesignerCategoryAttribute("code")]
    [System.Xml.Serialization.XmlTypeAttribute(AnonymousType=true)]
    public partial class xmlDataTable {

        private int identField;

        private int stockField;

        private double pricewotaxField;

        private double discountpercentField;

        private xmlDataTableTable[] pricebyquantityField;

        /// <remarks/>
        public int ident {
            get {
                return this.identField;
            }
            set {
                this.identField = value;
            }
        }

        /// <remarks/>
        public int stock {
            get {
                return this.stockField;
            }
            set {
                this.stockField = value;
            }
        }

        /// <remarks/>
        public double pricewotax {
            get {
                return this.pricewotaxField;
            }
            set {
                this.pricewotaxField = value;
            }
        }

        /// <remarks/>
        public double discountpercent {
            get {
                return this.discountpercentField;
            }
            set {
                this.discountpercentField = value;
            }
        }

        /// <remarks/>
        [System.Xml.Serialization.XmlArrayItemAttribute("Table", IsNullable=false)]
        public xmlDataTableTable[] pricebyquantity {
            get {
                return this.pricebyquantityField;
            }
            set {
                this.pricebyquantityField = value;
            }
        }
    }

    /// <remarks/>
    [System.CodeDom.Compiler.GeneratedCodeAttribute("xsd", "2.0.50727.1432")]
    [System.SerializableAttribute()]
    [System.Diagnostics.DebuggerStepThroughAttribute()]
    [System.ComponentModel.DesignerCategoryAttribute("code")]
    [System.Xml.Serialization.XmlTypeAttribute(AnonymousType=true)]
    public partial class xmlDataTableTable {

        private int quantityField;

        private double pricewotaxField;

        private double discountpercentField;

        /// <remarks/>
        public int quantity {
            get {
                return this.quantityField;
            }
            set {
                this.quantityField = value;
            }
        }

        /// <remarks/>
        public double pricewotax {
            get {
                return this.pricewotaxField;
            }
            set {
                this.pricewotaxField = value;
            }
        }

        /// <remarks/>
        public double discountpercent {
            get {
                return this.discountpercentField;
            }
            set {
                this.discountpercentField = value;
            }
        }
    }
}

One caveat: deserializing may not be the most performant way to parse a 20MB file. XmlReader is likely the fastest way to do it, but that means doing things manually.

code4life
BTW, I generated the xsd using the XmlSchemaInference class.
code4life
Thanks, though I decided to go with Linq to Xml to parse this, so I'm not relying on serialization.
mare