views:

395

answers:

3

When I receive XML data (via a Twitter API call, in this instance), I imagine it's best practice to somehow validate it before I begin working with it? My app has had a lot of untractable issues lately, and I want to rule out bad XML data.

Does XML ever go "bad" somehow? Would an overloaded server like Twitter's ever spit out just half of what should come my way?

My real question is twofold: should I validate XML data before I work with it, and how would I go about doing that? (I already know the supposed structure of the XML data)

Thanks!

One last clarification before I select an answer (and thanks for your efforts): If I only need 5 predictable fields out of the static-length XML file, does something like this leave loopholes that creating an XSD overcomes?

if(!isset($xml->id, $xml->text, $xml->created_at, $xml->sender, $xml->recipient)) throw...
A: 

To answer your question:

Input validation is one of the main parts of error handling. You should always assume that you can get bad data, and then guard against it as best you can.

To validate XML, you validate it against a schema (usually kept in an XSD file).

You can infer a schema from an XML file. MSFT has a free tool that can do this, XSD.exe (it comes with Visual Studio), or use another 3rd party tool. However, the downside to this is that you will need to update the schema's if Twitter ever updates their format. Without a schema, you ensure that the XML is wellformed (usually by attempting to parse it), and just assume that the data you expect is missing and defensively code around it.

Alan
any chance you could find the XSDs used for Twitter XML data?
Jweede
They have examples here: http://apiwiki.twitter.com/Return-Values that show what it should look like. Can I generate a XSD file somehow?
Alex Mcp
+2  A: 

The most obvious method of validating your XML would be:

  1. Attempt to load the XML into your favourite DOM container or parse it using some other mechanism (I'm not completely familair with XML processing in PHP). This would allow you to check if the XML is 'well formed'. If the XML is not well formed (i.e. you only got half the XML response back) then you'd catch this problem at this point and deal with it.

  2. Once you've successfully loaded/parsed the XML the next thing is to validate it against an XML schema. Unfortunately Twitter don't publish XML schemas for their XML so you'd need to roll these yourself.

You can create your own XML schema's by hand. Here's a link that will help you get started:

XML Schema Tutorial (W3 Schools)

You can also get tools such as Altova XMLSpy that can 'infer' a schema from your XML. i.e. it makes a best guess as to how to define the schema, you may have to tweak it after generation. There are other free tools out there but I've only ever used XMLSpy. As Alan says, if Twitter ever change the format for their XML you would need to update your schemas to take account for these changes.

Creating XML Schemas can be daunting at first but once you get the hang of it you'll find it quite easy. I found this book to be excellent when I first started out:

XML Schema - The W3C's Object-Oriented Descriptions for XML (O'Reilly Press)

Kev
Can you elaborate on rolling your own schemas? I'm not sure how to start something like this...
Alex Mcp
A: 

It's unfortunate that Twitter is publishing an XML API but not publishing schemas.

The advantage of writing your own schema is that you can code your program to process messages that are valid according to your schema. Then, if Twitter changes their API, or if there's an undocumented feature that emits a message format you're not expecting, or if you've misunderstood their documentation, instead of digging around in your program having to find out why it's malfunctioning you'll get a validation error straight away. You won't necessarily know why the message is in a form you weren't expecting, but at least you'll know that's what the problem is.

Robert Rossney