views:

181

answers:

8

In the years that I've been at my place of employment, I've noticed a distinct trend towards something that I consider an anti-pattern: Maintaining internal data as big strings of XML. I've seen this done a number of different ways, though the two worst offenders were quite similar.

The Webservice

The first application, a web service, provides access to a potentially high volume of data within a SQL database. At startup, it pulls more-or-less all of that data out of the database and stores it in memory as XML. (Three times.) The owners of this application call it a cache. I call it slow, because every perf problem that's been run into while working against this has been directly traceable to this thing. (It being a corporate environment, there should be no surprise that the client gets blamed for the perf failure, not the service.) This application does use the XML DOM.

The Importer

The second application reads an XML file that was generated as the result of an export from a third-party database. The goal is to import this data into a proprietary system (owned by us). The application that does it reads the entire XML file in and maintains at least two, sometimes as many as four, copies of the XML file throughout the entire importing sequence. Note that the data can be manipulated, transformed, and configuration can occur before the import takes place, so the importer owns this data in an XML format for it's entire lifetime. Unsurprisingly, this importer then explodes when a moderately sized XML file is provided. This application only uses the XML DOM for one of it's copies, the rest are all raw XML strings.

My understanding of common sense suggests that XML is not a good format for holding data in-memory, but rather data should be translated into XML when it's being output/transferred and translated into internal data structures when being read in and imported. The thing is, I'm constantly running into production code that completely ignores the scalability issues, and goes through a ton of extra effort to do so. (The sheer volume of string parsing in these applications is frightening.)

Is this a common failure to apply the right tool for the job that others people run into alos? Or is it just bad luck on my part? Or am I missing some blindingly obvious and good situations where it's Right and OK to store high volumes of data in-memory as XML?

+2  A: 

No, I agree. For your first example, the database should handle almost all the caching, so storing all the data in program memory is wrong. This applies whether it's stored in-memory as XML or otherwise.

For the second, you should convert the XML into a useful representation as soon as possible, probably a database, then work with it that way. Only if it's a small amount of data would it be appropriate to do all work in-memory as a XmlDocument (e.g. using XPath). String parsing should be used very sparingly.

Matthew Flaschen
+4  A: 

Any data stored in memory should be in classes. The higher volume of data we are talking about, the more important this becomes. Xml is a hugely bloated format that reduces performance. Xml should be used only for transfering data between applications. IMHO.

Nate Bross
A: 

I agree as well, and I do think there is an element of bad luck.

...but grabbing for straws, the only use I could see for data being stored as XML is for automated unit tests, where XML provides an easy way to mock up test data. Definitely not worth it, though.

akf
A: 

I've found that I've had to do it to interact with a legacy COM object. The COM object could take either xml or a class. The interop overhead to fill each member of the class was way too large and processing xml was a much faster alternative. We could have made a c# class identical to the COM class, but it was really too difficult to do in our timeframe. So xml it was. Not that it would ever be a good design decision, but when dealing with interop for huge data structures, it was the fastest we could do.

I do have to say that we are using LinqtoXML on the C# side, so it makes it slightly easier to work with.

Steve
+1  A: 

@Matthew Flaschen makes a great point. I would like to add that when you join any existing project, you are likely to find some design and implementation decisions that you disagree with.

We all learn new things all the time and we all make mistakes. Though I agree that this seems like a "duh" kind of problem, I'm sure the other developers were trying to optimize the code through the concept of a cache.

The point is, sometimes it takes an gentle approach to convince people, especially developers, to change their ways. This isn't a coding problem, but a people problem. You need to find a way to convince these developers that these changes you are suggesting don't imply they are incompetent.

I'd suggest agreeing with them that caching can be a great idea, but that you'd like to working on it to speed up the functions. Create a quick demo of how your (way more logical) implementation works compared with the old way. It's hard to argue with dramatic speed improvements. Just be careful about directly attacking the way they implemented in conversation. You need these people to work with you.

Good luck!

Michael La Voie
I don't question the intent of the "optimization," only the competence in its implementation.
Greg D
If you really want to cache, wouldn't a simple memcached implementation be better? Just caching whatever output you're sending (XML?)
Osama ALASSIRY
A: 

what about OOP and Databases? Xml has it's uses but there can be issues (as you are seeing) with using it for everything.

Databases can allow for indexing, transactions, etc. that will speed up your data access

Objects are in most cases easier to work with, They give a better picture of your domain, etc.

I am not against using xml but it is like patterns, they are a tools that we should understand where and when to use them, not fall in love with them and try to use them everywhere...

J.13.L
A: 

Greg,

in several applications I did follow more or less exactly the pattern you describe:

Edit: no scratch that. I never stored the XML as a string (or multiple strings). I just parsed it into a DOM and worked with that. THAT was helpful.

I've imported XML sources into the DOM (Microsoft Parser) and kept them there for all the required processing. I'm well aware of the memory overhead the DOM causes, but I found the apporach quite useful nonetheless.

  • Some checks during processing need random access to the data. The selectPath statement works quite well for this purpose.

  • DOM nodes can be handed back and forth in the application as arguments. The alternative is writing classes wrapping every single type of object, and updating them as the XML schema evolves. It's a poor (VB6/VBA) man's approach to polymorphism.

  • Applying an XSLT transformation to all or parts of the DOM is a snap

  • File I/O is taken care of by the DOM too (xmldoc.save...)

A linked list of objects would consume a comparable amount of memory and require more code. All the search and I/O functionality I would have to code myself.

What I've perceived as the anti-pattern is actually an older version of the application, where the XML was parsed more or less manually into arrays of structures.

A: 

For high volumes of data the answer is no, there aren't good reasons to store data directly as XML strings in memory.

However, here is an interesting presentation, by Alex Brown, on how to preserve XML in memory in a more efficient way. As a 'Frozen Stream'.

There is also a video of this, and other presentations given at XML Prague 2009 here.

link text

pgfearo