views:

239

answers:

5

I understand that there's no universal answer to the attribute vs. element debate (and I read through the other questions I saw on this), but any insight into this particular circumstance would be greatly appreciated.

In our case we're going to be receiving very large amounts of master and transactional data from a system of record to be merged into our own database (upwards of a gig, nightly). The information we receive is essentially a one-for-one with the records in our tables, so for example a list of customers would be (in our old version):

<Custs>
  <Cust ID="101" LongName="Large customer" ShortName="LgCust" Loc="SE"/>
  <Cust ID="102" LongName="Small customer" ShortName="SmCust" Loc="NE"/>
  ....
</Custs>

However we've been discussing the merits of moving to a structure that's more element based, for example:

<Custs>
  <Cust ID="101">
    <LongName>Large Customer</LongName>
    <ShortName>LgCust</ShortName>
    <Loc>SE</Loc>
  </Cust>
  <Cust ID="102">
    <LongName>Small Customer</LongName>
    <ShortName>SmCust</ShortName>
    <Loc>NE</Loc>
  </Cust>
  ....
</Custs>

Because the files are so large I don't think we'll be using a DOM parser to try to load these into memory, nor do we have any need of locating particular items in the files. So my question is: in this case, is one form (elements or attributes) generally preferred over the other when you've got large amounts of data and performance demands to consider?

+1  A: 

If performance is the only requirement, I think you have to go with the attributes, just because it takes up less space. I don't see any advantage to the elements.

David Norman
We did some experimenting with different formats and gathering their performance results. Ultimately we decided to stick with Attributes. Thanks for your input!
inyourcorner
+1  A: 

I have used both methods with very large files both with DOM and with a line-by-line reader. Certainly you need to use a line-by-line reader to get good performance for very large files. Beyond this my gut feeling is that attributes are more efficient but I have no hard data to back that opinion up with!

Werg38
+1  A: 

If someone's providing you with 1gb of data at a time and you care about performance at all, you should really re-examine the decision to use XML as your transmission format. You're not parsing the data into a DOM, so you're not really able to make use of the benefits that XML gives you over (say) CSV -- ensuring well-formedness, schema validation, transformation, querying, etc.

And now you're considering moving to a format where half of the data that you're going to be processing is markup. What kind of sense does that make?

I come from the when-the-only-tool-you-have-is-a-hammer-you-tend-to-perceive-all-problems-as-nails school of XML, and even I wouldn't use XML for this.

Robert Rossney
As it turns out, we're considering suggesting (to client) that we wrap actual CSV data in xml headers so that we can specify schema (and keep it extensible to a point) while taking advantage of what csv gives you, which is raw, lean data. Thanks for the answer Robert
inyourcorner
+1  A: 

The "attribute way" is more preferable if you plan to validate your xml prior to processing by means of a plain old DTD. There is no rule to validate one element content in DTD language but some basic rules can be applied to attribute values.

If you plan to use XSD or no validation at all then I would choose the most readable form, which IMHO is the "element way".

No matter where the XML comes from, XML validation should be the first step to process any XML. It makes your application safer and your code smaller since many checks are made before your code even toches the XML data. XSD should be the preferred choice since its syntax allows to check even data conversions (ie float, date fields inside element or attribute content). The con, it is much more complex than a plain DTD file.

Fernando Miguélez
+1  A: 

Exchanging the data in XML format isn't necessarily bad just because it is a large data set.

However, if you are exchanging really big XML files you might want to consider compressing them before transmission using zip, GZIP, etc. in order to save time and bandwidth.

If you are exchanging database info, consider formatting the information as SQL statements(and even compressing those SQL files before sending); especially if that is what you wind up converting the XML into anyway.

Mads Hansen