views:

374

answers:

7

When designing an XML feed for structured data, what is good practice, and what anti-patterns are there?

I'd like answers that cover XML structure and content, and/or transport mechanisms.

Transport Mechanisms

With current technologies is FTP/SFTP a good technology? Are there cases where it is the best fit as a solution?

Generally I prefer HTTP pull feeds, but what weaknesses does using HTTP have?

What other feed mechanisms should be considered with their pros and cons?

XML Structure Content

When there is no suitable existing DTD/schema that exists, what practices can be followed to come up with a good XML design?

Two anti-patterns for this I have already given in my answer below.

But what should I be doing when designing a feed? I'd like to hear about tags vs attributes, how relational data (esp. many-to-many relationships) should be conveyed in XML, etc.

Note: I have completely rewritten the question, as even with the bounty offered it wasn't getting a lot of love. (The old version is in the edit history if you want to see it. This version should be pertinent to the answers already given)

+2  A: 

Without a DTD / Schema you have no way to knowing if a feed is valid until your code encounters a problem. So for me schemas are very important, both as an XML consumer and a producer.

Even a simple schema is useful, defining the elements, how many times they occur etc. A detailed schema, with restrictions or enumerations as needed is even nicer. When I have those I can minimise the amount of errors in the XML I produce, or I can validate the whole file if it's sent to me and reject it as non-compliant as necessary. It's just a neat, standard way of performing input validation.

blowdart
+4  A: 

A good feed has

1) A schema, because that way you can check it programatically and you know when it's been changed - saves lots of arguements

2) Tells you when it's down

3) Works consistently

4) Will handle stops, starts, pause, rewind gracefully

5) Has a test service that fully exercises all the existing feed features

6) Has a new features service for sand box development

Realistically I've only worked with feeds that deliver 1 and sometimes 2, but we can dream.

MrTelly
A: 

One personal bugbear of mine at the moment are timestamps without timezone information. If you are dealing with feeds from all over the world, a time without a timezone is meaningless.

Edit: And feeds which don't include an encoding attribute, or include one, but then don't respect it!

DanSingerman
Common recommendation for timestamps: always use UTC.
Bill Karwin
+1  A: 

It's a good question, but I don't know how much further it goes than schema good, !schema bad.

I've had to consume feeds which failed to provide or provided broken schemas and realistically all you can do is transform those into namespace-less clones, which is workable but risky as hell.

I18N and especially number formats and datestamps are a massive problem. Best practice is of course declaring your format in the doc, and preferably defaulting to UTC time.

I guess the only other good practice I can suggest is where consuming multiple feeds which need to interact don't try and deal with them on their terms, instead the first thing you need to do is deserialise them to a standard object or transform them to a standard internal schema.

annakata
+1  A: 

Without knowing your real requirements, it is difficult to make recommendations for transport mechanisms or styles. For instance, if you're doing pull based syndication, HTTP can offer features that assist with caching. If you're doing push based or publish/subscribe protocols like XMPP could be used.

For your feed itself, I'd recommend sticking to a public specification such as Atom (or maybe an RSS variant if you want). Atom incorporates some of the items you mentioned such as encoding content and date formats (using UTC is easiest in most cases, then convert to a user's local time for display). By sticking to standard formats, you also allow use of feed parsers that support that spec.

Atom and RSS are flexible enough to allow you to define your own XML namespaces to add whatever elements and attributes you need. If your data produced doesn't map onto the feed/entry data model, then maybe they aren't the best fit for you.

If you are using XML, parent/child relationships (where the child only has 1 parent) these can be easily modeled as parent/child elements. If the child has multiple parents, you can use reference and attributes to link elements.

David Schlosnagle
Generally though we have to supply structured data, where there is no obvious agreed standard. So unless I am missing something, I don't think ATOM/RSS would suffice. e.g. Tennis scores. There is a SportML DTD for such things, but it seems massively verbose.
DanSingerman
Who is consuming your data? Are they other computerized services or humans via something like Google Reader? If you are the supplier, can't you define the standard used to supply the data? It seems like you need to choose to lead (define your own) or follow (use an SportsML).
David Schlosnagle
A: 

I think MediaRSS is a pretty good feed schema. I like it because:

  • It is flexible enough to contain almost any type of content.
  • It lets you define groups of media with in the feed (useful, e.g., when you have multiple resolutions of an image, or multiple formats).
  • It defines pretty much all the basic metadata common to all types of media, but doesn't require all of them. I haven't run into any media I wanted to put into a feed it couldn't represent.

One thing I would like it to have that it doesn't is a tag for arbitrary parameters that should be passed to the player of a given piece of media, but I don't think that really makes sense since the feed shouldn't have to know anything about the player. But sometimes I just have to pass params to the Flash player.

jeffamaphone
A: 

Well, quite honestly, "best practices" are not universal, so any answer will only be applicable for the particular problem that is being solved.

However, in my experience, here is a list of general XML and protocol design elements.

  • Avoid FTP/SFTP whenever possible because of reliability and, especially with SFTP, they are not universal implementations. Also, most firewalls will allow port 80, but you can run into blocked ports for FTP/SFTP.
  • Implement a schema with a namespace that has a version or date in it. For example, http://yourcompany.com/xml/myfeed/2009/03. That conveys information about when the schema was revised and also indicates a version number, which is useful for clients.
  • If your feed is publicly exposed, consider implementing various RDF tags for your data. Your data will then become part of the Semantic Web.
  • If your content supports it, use RSS or Atom, because there are tons of clients out there that understand those formats already, so it dramatically increases your usability.