views:

446

answers:

9

Our C++ application reads configuration data from XML files that look something like this:

<data>
 <value id="FOO1" name="foo1" size="10" description="the foo" ... />
 <value id="FOO2" name="foo2" size="10" description="the other foo" ... />
 ...
 <value id="FOO300" name="foo300" size="10" description="the last foo" ... />
</data>

The complete application configuration consist of ~2500 of these XML files (which translates into more than 1.5 million key/value attribute pairs). The XML files come from many different sources/teams and are validated against a schema. However, sometimes the nodes look like this:

or this:

To make this process fast, we are using Expat to parse the XML documents. Expat exposes the attributes as an array - like this:

void ExpatParser::StartElement(const XML_Char* name, const XML_Char** atts)
{
 // The attributes are stored in an array of XML_Char* where:
 //  the nth element is the 'key'
 //  the n+1 element is the value
 //  the final element is NULL
 for (int i = 0; atts[i]; i += 2) 
 {
  std::string key = atts[i];
  std::string value = atts[i + 1];
  ProcessAttribute (key, value);
 }
}

This puts all the responsibility onto our ProcessAttribute() function to read the 'key' and decide what to do with the value. Profiling the app has shown that ~40% of the total XML Parsing time is dealing with these attributes by name/string.

The overall process could be sped up dramatically if I could guarantee/enforce the order of the attributes (for starters, no string comparisons in ProcessAttribute()). For example, if 'id' attribute was always the 1st attribute we could deal with it directly:

void ExpatParser::StartElement(const XML_Char* name, const XML_Char** atts)
{
 // The attributes are stored in an array of XML_Char* where:
 //  the nth element is the 'key'
 //  the n+1 element is the value
 //  the final element is NULL
 ProcessID (atts[1]);
 ProcessName (atts[3]);
 //etc.
}

According to the W3C schema specs, I can use <xs:sequence> in an XML schema to enforce the order of elements - but it doesn't seem to work for attributes - or perhaps I'm using it incorrectly:

<xs:element name="data">
 <xs:complexType>
  <xs:sequence>
   <xs:element name="value" type="value_type" minOccurs="1" maxOccurs="unbounded" />
  </xs:sequence>
 </xs:complexType>
</xs:element>

<xs:complexType name="value_type">
 <!-- This doesn't work -->
 <xs:sequence>
  <xs:attribute name="id" type="xs:string" />
  <xs:attribute name="name" type="xs:string" />
  <xs:attribute name="description" type="xs:string" />
 </xs:sequence>
</xs:complexType>

Is there a way to enforce attribute order in an XML document? If the answer is "no" - could anyone perhaps suggest a alternative that wouldn't carry a huge runtime performance penalty?

+1  A: 

I don't think XML Schema supports that - attributes are just defined and restricted by name, e.g. they have to match a particular name - but I don't see how you could define an order for those attributes in XSD.

I don't know of any other way to make sure attributes on a XML node come in a particular order - not sure if any of the other XML schema mechanisms like Schematron or Relax NG would support that....

marc_s
It's not a restriction of XML schema but of XML itself. See st.stoqnov's comment.
Porges
A: 

Just a guess, but can you try adding use="required" to each of your attribute specifications?

<xs:complexType name="value_type">
 <!-- This doesn't work -->
 <xs:sequence>
  <xs:attribute name="id" type="xs:string" use="required" />
  <xs:attribute name="name" type="xs:string" use="required" />
  <xs:attribute name="description" type="xs:string" use="required" />
 </xs:sequence>
</xs:complexType>

I'm wondering if the parser is being slowed down by allowing optional attributes, when it appears your attributes will always be there.

Again, just a guess.

EDIT: XML 1.0 spec says that attribute order is not significant. http://www.w3.org/TR/REC-xml/#sec-starttags

Therefore, XSD won't enforce any order. But that doesn't mean that parsers can't be fooled into working quickly, so I'm keeping the above answer published in case it actually works.

Tenner
A: 

I'm pretty sure there's no way to enforce attribute order in an XML document. I'm going to assume that you can insist on it via a business process or other human factors, such as a contract or other document.

What if you just assumed that the first attribute was "id", and tested the name to be sure? If yes, use the value, if not, then you can try to get the attribute by name or throw out the document.

While not as efficient as calling out the attribute by its ordinal, some non-zero number of times you'll be able to guess that your data providers have delivered XML to spec. The rest of the time, you can take other action.

Chris McCall
+1  A: 

The answer is no, alas. I'm shocked by your 40% figure. I find it hard to believe that turning "foo" into ProcessFoo takes that long. Are you sure the 40% doesn't include the time taken to execute ProcessFoo?

Is it possible to access the attributes by name using this Expat thing? That's the more traditional way to access attributes. I'm not saying it's going to be faster, but it might be worth a try.

Gary McGill
'Expat thing' is one of the fastest parser around.. Don't be shocked, you've just been sold XML by MSFT and IBM and it doesn't scale :-)
rama-jka toti
Gary, you're correct. I didn't elaborate on exactly what the ProccessAttribute() function does because I thought it was off-topic to the original question... We are parsing these XML documents on application startup and dumping the element data into an sqlite database for later processing. The sqlite API allows binding of parameters by index - so if I could be confident that the XML attributes were always in the same order as the parameters in the Insert statement, things would go much (much) faster.
Kassini
+5  A: 

According to the xml specification,

the order of attribute specifications in a start-tag or empty-element tag is not significant

You can check it at section 3.1

st.stoqnov
A: 

From what I recall, Expat is a non validating parser and better for it.. so you can probably scrap that XSD idea. Neither is the order-dependent a good idea in many XML approaches (XSD got criticised on element order a heck of a lot back in the day, for example, by pro or anti- sellers of XML Web Services at MSFT).

Do your custom encoding and simply extend either your logic for more efficient lookup or dig into the parser source. It is trivial to write the tooling around encoding efficient replacement whilst shielding the software agents and users from it.. you want do to this so it is easily migrated while preserving backward compatibility and reversibility. Also, go for fixed-size constraints/attribute-name-translation.

[ Consider yourself lucky with Expat :) and its raw speed. Imagine how CLR devs love XML scaling facilities, they routinely send 200MB on the wire in process of 'just querying the database' .. ]

rama-jka toti
A: 

Why not either put the attributes name/values into a map or just sort them, then process them?

jon hanson
+2  A: 

XML attributes don't have an order, therefore there is no order to enforce.

If you want something ordered, you need XML elements. Or something different from XML. JSON, YAML and bEncode, e.g. have both maps (which are unordered) and sequences (which are ordered).

Jörg W Mittag
A: 

As others have pointed out, no, you can't rely on attribute ordering.

If I had any process at all involving 2,500 XML files and 1.5 million key/value pairs, I would get that data out of XML and into a more usable form as soon as I possibly could. A database, a binary serialization format, whatever. You're not getting any advantage out of using XML (other than schema validation). I'd update my store every time I got a new XML file, and take parsing 1.5 million XML elements out of the main flow of my process.

Robert Rossney