views:

827

answers:

13

As an amateur software developer (I'm still in academia) I've written a few schemas for XML documents. I routinely run into design flubs that cause ugly-looking XML documents because I'm not entirely certain what the semantics of XML exactly are.

My assumptions:

<property> value </property>

property = value

<property attribute="attval"> value </property>

A property with a special descriptor, the attribute.

<parent>
  <child> value </child>
</parent>

The parent has a characteristic "child" which has the value "value."

<tag />

"Tag" is a flag or it directly translates to text. I'm not sure on this one.

<parent>
  <child />
</parent>

"child" describes "parent." "child" is a flag or boolean. I'm not sure on this one, either.

Ambiguity arises if you want to do something like representing cartesian coordinates:

<coordinate x="0" y="1 />

<coordinate> 0,1 </coordinate>

<coordinate> <x> 0 </x> <y> 1 </y> </coordinate>

Which one of those is most correct? I would lean towards the third based upon my current conception of XML schema design, but I really don't know.

What are some resources that succinctly describe how to effectively design xml schemas?

A: 

Look at the relationships of the data you are trying to represent is the best approach that I've found.

HTH

cheers,

Rob

Rob Wells
My question is not what relationships to map to what other relationship. My question is what relationships should map to each syntactical unit of xml like tags and attributes.
evizaer
But my response is based on the fact that you should know the relationships of the entities in your data. For example, one email address per each user name.
Rob Wells
+2  A: 

XML is somewhat subjective in terms of design - I don't think there are exact guidelines for how the elements and attributes should be laid out, but I tend to go with using elements to represent 'things' and attributes to represent singular attributes/properties of them.

In terms of the coordinates example either would be perfectly acceptable, but my inclination would be to go with <coordinate x="" y=""/> because it is somewhat more terse, and makes the document a little more readable if you have many of them.

The most important thing, though, is the namespace of the schema. Make sure that (a) you have one, and (b) you have a version in there so you can change things in the future and issue a new version. Versions may be either dates or numbers, e.g.

http://company.com/2008/12/something/somethingelse/
urn:company-com:2008-12:something:somethingelse

http://company.com/v1/something/somethingelse/
urn:company-com:v1:something:somethingelse
Greg Beech
A: 

I often find myself struggling with the same issue but I find that in practice it doesn't really matter, xml is just data.

That said, I usually prefer the "if it says something about the node it's an attribute, otherwise it's a childnode" approach.

In your example i'd go for:

<coordinate>
    <x>0</x>
    <y>1</y>
</coordinate>

because the x and y are properties of a coordinate, not actually saying anything about the xml, but about the object represented by it.

Kris
A: 

Two rules for XML 1) Don't use XML 2) When you think you might have come across a situation where it would be a good idea to use XML, see rule 1.

I know I'll probably get downvoted for this, but I have designed lots of interfaces to systems that used XML, and never once have I found that using the XML format made it any easier, better or faster than it could have been done in another format and it usually makes it harder.

Kevin
Yay let's go back to the good old days of SWIFT and EDI! Who needs all these human readable, easily parseable, or self describing formats? Or things like standardised description, validation, transformation and query languages?
Greg Beech
It's true, XML is hard if you have no clue! otoh; Kevin has a valid point if you take it with a grain of salt. XML is often not the best way to do a lot of things that unfortunately are already XML
Kris
how is this:<xml><tran><tranid>1</tranid><amount>50.00</amount><trancode>42</trancode></tran></xml>any more human readable, easily parsible or self describing than:tranid,amount,trancode1,50.00,42
Kevin
This is an example of an useless,harmful answer. Imagine you're asking what are the best practices for driving and as answer you get: "Don't drive cars!". Such answers should not only be downvoted, but be selected as most useless or worst answers on SO. Sure he doesn't know to drive himself! :)
Dimitre Novatchev
To use your car analogy: so if someone asked you, "My car is really old, the brakes don't work, the gas tank leaks and the tailpipe drags causing sparks, how can I drive it safely?" you would say "Make sure to use your turn signal"?
Kevin
Unless you are using good libraries or stuff like XSDOBjectGen it's right XML is PITA. But the crucial point of XML is it's well supported in every language and there are lovely APIs read, write, access it.
dr. evil
Late to the party, but to the readability comment, XML does a superior job of nesting and hierarchies than a delimited format. Tabular data is one thing, complex/hierarchical data is another.
micahtan
+1  A: 

When designing an XML-based format, it's often good to think about what you're representing. Try mocking some XML data that fits the purpose you intend. Once you've got something you're satisfied with that meets your requirements, develop the schema to validate it.

When desiging a format, I tend to use elements for holding data content and attributes for applying characteristics to the data like an id, a name, a type, or some other metadata about the data an element contains.

In that regard, an XML representation for coordinates might be:

<coordinate type="cartesian">
  <ordinate name="x">0</ordinate>
  <ordinate name="y">1</ordinate>
</coordinate>

This caters for different coordinate systems. If you knew they'd always be cartesian, a better implementation might be:

<coordinate>
  <x>0</x>
  <y>1</y>
</coordinate>

Of course, the latter could lead to a more verbose schema as each element type would need declaring (though I'd hope a complex type was defined to actually do the hard work for these elements).

Just as in programming, there are often multiple ways of achieving the same ends, but there is no right and wrong in many situations, just better and worse. The important thing is to remain consistent and try to be intuitive so that when others look at your schema, they can understand what you were trying to achieve.

You should always version your schemas and ensure that XML written against your schema indicates it as such. If you don't properly version the XML then making addendums to the schema while supporting XML written to the older schema will be much more difficult.

Jeff Yates
A: 

I guess, it depends on how complex or simple the structure is.
I will make x and y as attribute, unless x and y have their own details

You can look at HTML or any other form of markup, which is used to define things (XAML in case of WPF, MXML in case of flash) to understand, why something is chosen as attribute as against a child node)

if x and y are not to be repeated, they can be attributes.

Lets say co-ordinates has multiple x and y, I guess xml doesnt allow multiple attributes with same name for a node. In that case, you will have to use child nodes.

shahkalpesh
A: 

There's nothing inherently wrong with using an element or sub-element for every value you'd like to represent.

The main consideration is that sometimes it's cleaner to use an attribute. Since an element can only have one attribute of a given name, you're stuck with a 1:1 cardinality. If you're representing the data as a child element, you can use whatever cardinality you'd like (or be open to extending it later).

Rob Wells' response above is right: it depends on the relationships you're trying to model.

Any time there's clearly never going to be anything but a 1:1 relationship, an attribute may be cleaner.

BQ
+2  A: 

I do not know any good learning resource about how to design XML document models (schemas are just a formal way of specifying document models).

In my opinion, one crucial insight to XML is that it is not a language: it is a syntax. And each document model is a separate language.

Different cultures will each use XML in their own special way. Even within W3C specifications you can smell Lisp in dash-separated-names of XSLT, and Java in the camelCaseNames of XML Schema. Similarly, different application domains will call for different XML idioms.

Narrative document models such as HTML or DocBook tend to put printable text in text nodes and metadata in element names and attributes.

More object-oriented document models such as SVG make little or no use of text nodes and instead only use elements and attributes.

My personal rules of thumb for document model design go something like this:

  • If it is the sort of the free-from tag soup that requires mixed content, use HTML and DocBook as sources of inspiration. The other rules are only relevant otherwise.
  • If a value is going to be composite or hierarchical, use elements. XML data should require no further parsing, except for established idioms such as IDREFS which are simple space-separated sequences.
  • If a value may need to occur more than once, use elements.
  • If a value may need to be refined further, or enriched later, use elements.
  • If a value is clearly atomic (boolean, number, date, identifier, simple label), and may occur at most once, then use an attribute.

Another way to say it could be:

  • If it's narrative, it's not object oriented.
  • If it's object oriented, model objects as elements and atomic attributes as attributes.

EDIT: Some people seem to like to entirely forgo attributes. There's nothing wrong with it, but I dislike it as it bloats documents and make them unnecessary hard to read and write by hand.

ddaa
A: 

On of my biggest general recommendations is to never store multiple logical pieces of data inside a single node (whether it be a text node or an attribute node). Otherwise, you end up needing your own parsing logic on top of the XML parsing logic you normally get for free from your framework.

So in your coordinate example, <coordinate x="0" y="1 /> and <coordinate> <x> 0 </x> <y> 1 </y> </coordinate> are both reasonable to me.

But <coordinate> 0,1 </coordinate> is not very good because it's storing two logical pieces of data (the X-coordinate and the Y-coordinate) in a single XML node forcing the consumer to parse the data outside of their XML parser. And while splitting a string by a comma is pretty simple, there are still some ambiguities like what happens if there's an extra comma at the end.

C. Dragon 76
+3  A: 
6eorge Jetson
+1  A: 

In our Java-projects we are often using JAXB to automatically parse XML and transform it into an object structure. I guess for other languagues you'll have something similar. A suitable generator can create automatically the object structure in your chosen programming language. This makes processing of XML often much easier, while still having a portable XML representation for the communication between systems.

If you do use such an automatic mapping, you will find this constrains the schema much - <coordinate> <x> 0 </x> <y> 1 </y> </coordinate> is the way to go unless you want to do special magic in the translation. You will end up with a Class Coordinate with two attributes x and y with the appropriate type as declared in the schema.

hstoerr
+3  A: 

See the tutorial:

     "XML Schemas: Best Practices" by Roger Costello.

I also recommend:

     Priscilla Walmsley's book "Definitive XML Schema".

     Jeni Tennison's XML Schema pages

Hope this helped.

Cheers,

Dimitre Novatchev

Dimitre Novatchev
A: 

http://www.xmlpatterns.com has a great list of methods for designing an XML grammar.

As stated above it is a subjective practice, but this site gives some useful directions, such as “use this pattern to solve problem X”…or “advantages and disadvantages are…”.