tags:

views:

2966

answers:

15

At work we are being asked to create XML files to pass data to another offline application that will then create a second XML file to pass back in order to update some of our data. During the process we have been discussing with the team of the other application about the structure of the XML file.

The sample I came up with is essentially something like:

<INVENTORY>
   <ITEM serialNumber="something" location="something" barcode="something">
      <TYPE modelNumber="something" vendor="something"/> 
   </ITEM>
</INVENTORY>

The other team said that this was not industry standard and that attributes should only be used for meta data. They suggested:

<INVENTORY>
   <ITEM>
      <SERIALNUMBER>something</SERIALNUMBER>
      <LOCATION>something</LOCATION>
      <BARCODE>something</BARCODE>
      <TYPE>
         <MODELNUMBER>something</MODELNUMBER>
         <VENDOR>something</VENDOR>
      </TYPE>
   </ITEM>
</INVENTORY>

The reason I suggested the first is that the size of the file created is much smaller. There will be roughly 80000 items that will be in the file during transfer. There suggestion in reality turns out to be three times larger than the one I suggested. I searched for the mysterious "Industry Standard" that was mentioned but the closest I could find was the XML attributes should only be used for meta data, but said the debate was about what was actually meta data.

After the long winded explanation (sorry) how do you determine what is meta data, and when designing the structure of an XML document how should you decide when to use an attribute or an element?

+2  A: 

Both methods for storing object's properties are perfectly valid. You should depart from pragmatic considerations. Try answering following question:

  1. Which representation leads to faster data parsing\generation?
  2. Which representation leads to faster data transfer?
  3. Does readability matter?

    ...

aku
+3  A: 

the million dollar question!

first off, don't worry too much about performance now. you will be amazed at how quickly an optimized xml parser will rip through your xml. more importantly, what is your design for the future: as the XML evolves, how will you maintain loose coupling and interoperability?

more concretely, you can make the content model of an element more complex but it's harder to extend an attribute.

Adam
+3  A: 

It is arguable either way, but your colleagues are right in the sense that the XML should be used for "markup" or meta-data around the actual data. For your part, you are right in that it's sometimes hard to decide where the line between meta-data and data is when modeling your domain in XML. In practice, what I do is pretend that anything in the markup is hidden, and only the data outside the markup is readable. Does the document make some sense in that way?

XML is notoriously bulky. For transport and storage, compression is highly recommended if you can afford the processing power. XML compresses well, sometimes phenomenally well, because of its repetitiveness. I've had large files compress to less than 5% of their original size.

Another point to bolster your position is that while the other team is arguing about style (in that most XML tools will handle an all-attribute document just as easily as an all-#PCDATA document) you are arguing practicalities. While style can't be totally ignored, technical merits should carry more weight.

erickson
+4  A: 

When in doubt, KISS -- why mix attributes and elements when you don't have a clear reason to use attributes. If you later decide to define an XSD, that will end up being cleaner as well. Then if you even later decide to generate a class structure from your XSD, that will be simpler as well.

Luke
+34  A: 

I use this rule of thumb:

  1. An Attribute is something that is self-contained, i.e., a color, an ID, a name.
  2. An Element is something that does or could have attributes of its own or contain other elements.

So yours is close. I would have done something like:

EDIT: Updated the original example based on feedback below.

  <ITEM serialNumber="something">
      <barcode encoding="Code39">something</barcode>
      <Location>XYX</LOCATION>
      <TYPE modelNumber="something">
         <VENDOR>YYZ</VENDOR>
      </TYPE>
   </ITEM>
Chuck
This is a good rule of thumb ;) I'm using it myself and I think many people do.
ivan_ivanovich_ivanoff
John Ballinger
Good point, John!
Chuck
Really late to the party, but the special ASCII char argument is wrong -- that's what escaping is for, both for attributes and text data.
micahtan
@micahtan: If you must consider escaping, it will be more expensive to serialize/deserialize. If you know it is never going to happen, you can just skip that extra overhead, and it will be much faster to execute.
awe
@awe: I was unaware that it was more expensive to deserialize w/escaping. Do you have a source reference for that? I deal primarily with .NET, and I haven't seen anything that mentions it.As far as "never going to happen", I've been burned quite a few times by that. If your XML contains numbers or codes, it may be a safe assumption. Proper names or user input text has a nasty habit of introducing those characters, particularly the ampersand and both single and double quotes.
micahtan
@micahtan: Escaping isn't enough. The rules for attributes are different. John Ballinger's note is correct. In particular, the character '<' can't be in an attribute regardless of escaping. See http://www.w3.org/TR/xml/#CleanAttrVals
Don Roby
@donroby - Sorry, that would be my mistake in communicating. By escaping, I mean XML encoding. '<' = < etc. It seems odd to me to decide between an attribute or element based on the characters that make up the content instead of the meaning of the content.
micahtan
@micahtan - No, I think actually the XML encoded version is not allowed. But I'm basing this just on reading the spec - perhaps I'll write a JUnit/XMLUnit test to check instead of continuing to trust my understanding of the spec, which is indeed quite difficult to decipher.But in deciding between attribute or value you really do have to consider what they in fact allow, and attributes seem to allow less than elements.
Don Roby
I have written a JUnit/XMLUnit test to check my understanding of the spec as noted above, and it seems that Java's SAX implementation quite happily accepts encoded '<' in attribute values. I still suspect it's not a good idea, but I can't back it up with code...
Don Roby
@donroby: it's incorrect. The replacement text of `<` is `<`, which is a character reference, not an entity reference. `<` is OK in attributes. See: http://www.w3.org/TR/REC-xml/#sec-predefined-ent
Porges
@John: if this is a problem then there's something in your toolchain which isn't producing valid XML. I don't think this is a reason to choose between attributes or elements. (Furthermore, you can't "just add CDATA tags" around user-input because it might contain `]]>`!)
Porges
+1  A: 

It's largely a matter of preference. I use Elements for grouping and attributes for data where possible as I see this as more compact than the alternative.

For example I prefer.....

<?xml version="1.0" encoding="utf-8"?>
<data>
    <people>
         <person name="Rory" surname="Becker" age="30" />
        <person name="Travis" surname="Illig" age="32" />
        <person name="Scott" surname="Hanselman" age="34" />
    </people>
</data>

...Instead of....

<?xml version="1.0" encoding="utf-8"?>
<data>
    <people>
        <person>
            <name>Rory</name>
            <surname>Becker</surname>
            <age>30</age>
        </person>
        <person>
            <name>Travis</name>
            <surname>Illig</surname>
            <age>32</age>
        </person>
        <person>
            <name>Scott</name>
            <surname>Hanselman</surname>
            <age>34</age>
        </person>
    </people>
</data>

However if I have data which does not represent easily inside of say 20-30 characters or contains many quotes or other characters that need escaping then I'd say it's time to break out the elements... possibly with CData blocks.

<?xml version="1.0" encoding="utf-8"?>
<data>
    <people>
        <person name="Rory" surname="Becker" age="30" >
            <comment>A programmer whose interested in all sorts of misc stuff. His Blog can be found at http://rorybecker.blogspot.com and he's on twitter as @RoryBecker</comment>
        </person>
        <person name="Travis" surname="Illig" age="32" >
            <comment>A cool guy for who has helped me out with all sorts of SVn information</comment>
        </person>
        <person name="Scott" surname="Hanselman" age="34" >
            <comment>Scott works for MS and has a great podcast available at http://www.hanselminutes.com </comment>
        </person>
    </people>
</data>
Rory Becker
This is flat wrong I'm afraid - you should follow W3C guidelines: http://www.w3schools.com/DTD/dtd_el_vs_attr.asp - XML should not be formed on readability or on making it "compact" - but rather using elements or attributes correctly for the purpose which they were designed for.
Vidar
I'm sorry, but this is misleading. The W3schools page is not W3C guidleines. The W3C XML recommendation (in which I was a participant) allows elements and attributes to be used according to the needs and styles of the users.
peter.murray.rust
+6  A: 

It may depend on your usage. XML that is used to represent stuctured data generated from a database may work well with ultimately field values being placed as attributes.

However XML used as a message transport would often be better using more elements.

For example lets say we had this XML as proposed in the answer:-

<INVENTORY>
   <ITEM serialNumber="something" barcode="something">
      <Location>XYX</LOCATION>
      <TYPE modelNumber="something">
         <VENDOR>YYZ</VENDOR>
      </TYPE>
    </ITEM>
</INVENTORY>

Now we want to send the ITEM element to a device to print he barcode however there is a choice of encoding types. How do we represent the encoding type required? Suddenly we realise, somewhat belatedly, that the barcode wasn't a single automic value but rather it may be qualified with the encoding required when printed.

   <ITEM serialNumber="something">
      <barcode encoding="Code39">something</barcode>
      <Location>XYX</LOCATION>
      <TYPE modelNumber="something">
         <VENDOR>YYZ</VENDOR>
      </TYPE>
   </ITEM>

The point is unless you building some kind of XSD or DTD along with a namespace to fix the structure in stone, you may be best served leaving your options open.

IMO XML is at its most useful when it can be flexed without breaking existing code using it.

AnthonyWJones
Good point on the "barcode", I rushed my example and would have definitely broken that out into its own element. Also good point on the XSD/DTD.
Chuck
+12  A: 

Some of the problems with attributes are:

* attributes cannot contain multiple values (child elements can)
* attributes are not easily expandable (for future changes)
* attributes cannot describe structures (child elements can)
* attributes are more difficult to manipulate by program code
* attribute values are not easy to test against a DTD

If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data.

Don't end up like this (this is not how XML should be used):

<note day="12" month="11" year="2002" to="Tove" from="Jani" heading="Reminder"  body="Don't forget me this weekend!"> </note>

Source: http://www.w3schools.com/DTD/dtd_el_vs_attr.asp

First point is incorrect, see: http://www.w3.org/TR/xmlschema-2/#derivation-by-list
Porges
I'd say that first point is correct and `list` is a partial workaround to this problem. There can't be multiple attributes with same name. With `list` attribute still has only one value, which is a whitespace separated list of some datatypes. Separation characters are fixed so you cannot have multiple values if a single value of the wanted datatype can contain whitespace. This rules out the chances for having for example multiple addresses in one "address" attribute.
jasso
+2  A: 

Use elements for data and attributes for meta data (data about the element's data).

If an element is showing up as a predicate in your select strings, you have a good sign that it should be an attribute. Likewise if an attribute never is used as a predicate, then maybe it is not useful meta data.

Remember that XML is supposed to be machine readable not human readable and for large documents XML compresses very well.

Michael J
+1  A: 

I agree with feenster. Stay away from attributes if you can. Elements are evolution friendly and more interoperable between web service toolkits. You'd never find these toolkits serializing your request/response messages using attributes. This also makes sense since our messages are data (not metadata) for a web service toolkit.

bagheera
+2  A: 

There is no universal answer to this question (I was heavily involved in the creation of the W3C spec). XML can be used for many purposes - text-like documents, data and declarative code are three of the most common. I also use it a lot as a data model. There are aspects of these applications where attributes are more common and others where child elements are more natural. There are also features of various tools that make it easier or harder to use them.

XHTML is one area where attributes have a natural use (e.g. in class='foo'). Attributes have no order and this may make it easier for some people to develop tools. OTOH attributes are harder to type without a schema. I also find namespaced attributes (foo:bar="zork") are often harder to manage in various toolsets. But have a look at some of the W3C languages to see the mixture that is common. SVG, XSLT, XSD, MathML are some examples of well-known languages and all have a rich supply of attributes and elements. Some languages even allow more-than-one-way to do it, e.g.

<foo title="bar"/>;

or

<foo>
  <title>bar</title>;
</foo>;

Note that these are NOT equivalent syntactically and require explicit support in processing tools)

My advice would be to have a look at common practice in the area closest to your application and also consider what toolsets you may wish to apply.

Finally make sure that you differentiate namespaces from attributes. Some XML systems (e.g. Linq) represent namespaces as attributes in the API. IMO this is ugly and potentially confusing.

peter.murray.rust
A: 

Just a couple of corrections to some bad info:

@John Ballinger: Attributies can contain any character data. < > & " ' need to be escaped to &lt; &gt; &amp; &quot; and &apos; , respectively. If you use an XML library, it will take care of that for you.

Hell, an attribute can contain binary data such as an image, if you really want, just by base64-encoding it and making it a data: URL.

@feenster: Attributes can contain space-separated multiple items in the case of IDS or NAMES, which would include numbers. Nitpicky, but this can end up saving space.

brianary
Not just ids or names. They can contain space-separated lists of just about anything.
John Saunders
+2  A: 

Others have covered how to differentiate between attributes from elements but from a more general perspective putting everything in attributes because it makes the resulting XML smaller is wrong.

XML is not designed to be compact but to be portable and human readable. If you want to decrease the size of the data in transit then use something else (such as google's protocol buffers).

Patrick
A: 

"XML" stands for "eXtensible Markup Language". A markup language implies that the data is text, marked up with metadata about structure or formatting.

XHTML is an example of XML used the way it was intended:

<p><span lang="es">El Jefe</span> insists that you
    <em class="urgent">MUST</em> complete your project by Friday.</p>

Here, the distinction between elements and attributes is clear. Text elements are displayed in the browser, and attributes are instructions about how to display them (although there are a few tags that don't work that way).

Confusion arises when XML is used not as a markup language, but as a data serialization language, in which the distinction between "data" and "metadata" is more vague. So the choice between elements and attributes is more-or-less arbitrary except for things that can't be represented with attributes (see feenster's answer).

dan04
A: 

I found this really good resource:

http://www.ibm.com/developerworks/xml/library/x-eleatt.html

Laurens Holst