ansaurus

Question

Which xml structure allows faster Add/Del/Update

Answer 1

A:

I doubt very much that you'd see a difference. XML parsing is very fast.

You'd have to test with hundreds of thousands, if not millions of records to measure the difference, which I think would be tiny.

Dave Swersky 2010-08-22 16:11:28

Since when is XML parsing very fast? It is commonly considered as very slow (in-memory presentation is 4-fold its size, mandatory support for Unicode, overhead compared to actual data) in comparison to more traditional means, which basically prevented XML to become too widely adopted in its early days. Streaming XML parsing is, however, relatively fast, but still no match for any binary compressed, optimized, specific format. (not to say that I don't love XML, because I do ;)

Abel 2010-08-22 16:21:20

I guess "fast" is a relative term. Once loaded into memory, *searching* is quite fast, though not as fast as searching binary formats, to be sure.

Dave Swersky 2010-08-22 16:22:55

Answer 2

+2 A:

It doesn't matter.

You have to read the entire file and parse it into a document structure, do the updates, then write the entire file. Updating the object structure is so little work compared to the file I/O that the structure doesn't matter.

Guffa 2010-08-22 16:13:07

+1 good points on mentioning file I/O as the likelier performance bottleneck (which means it does matter: the smaller the structure, the less file I/O and attributes are, by default, quite a bit smaller than elements)

Abel 2010-08-22 16:16:32

@Abel: Yes, a smaller structure would give you less data to write, which would make it faster if you have a huge file. However, if you have so much data that it's really a performance problem, XML might not be the best solution...

Guffa 2010-08-22 16:20:25

Answer 3

A:

The only way to find out which one is faster is to create some sample queries and run them a bunch of times while profiling and averaging. I doubt you'll find a difference.

I would go with which ever schema is more expressive and meets your requirements. To me that's the first one since I doubt you'd ever want more then one Id or IsVisible type.

Christopher Painter 2010-08-22 16:13:34

Answer 4

A:

It would depend on what you were using to do this addition, updating and deleting. All things being equal, I would expect the first one, but by a truly very, very negliable amount. I would also not be even slightly amazed if I found that there were some libraries that worked faster with the second (due to differences in in-memory model representations, which are completely implementation-defined).

Assuming there will only be one id and one isVisible on each department, I'd go for the first (with the bug of the attribute not being quoted, fixed) as helping to restict the format in itself, and being a clear fit. I wouldn't be upset at having to use the latter though.

Jon Hanna 2010-08-22 16:18:58

Answer 5

A:

In general

In general terms I tend to agree with the other answers here, but I'd like to add a few remarks. Performance is normally most hindered by its slowest factor, which is the network, the database connection, the file system or even the internal memory when I/O is part of the issue. If we take that as a given, a possible conclusion is that the smaller the size, the bigger the performance improvement is.

Other factors

But there's another factor. Attributes and elements are implemented differently. Attributes are implemented something like key/value pairs with a uniqueness constraint and roughly take the size of chars * 2 + sizeof(int). Elements require a much larger structure in-memory and for the sake of brevity, I like to use one simple factor that's some average between several implementations: 3.5 * chars. I use chars here, because whether you store it as UTF8 or as UTF16 makes a storage difference, but not an in-memory difference.

The former paragraph implies that attributes are faster. But still this isn't a simple fact, because attributes are not implemented as normal nodes and searching for their data is generally slower than searching for data in nodes. This is hard to measure in general terms and requires profiling for every particular situation to find out.

LINQ

Then there's LINQ. If you use LINQ, reading and writing is done with streaming XML which is relatively fast. The in-memory representation is usually much smaller and much faster than with XmlDocument parsing.

Names

The size of the names of the fields, like elements and attributes does not matter. Internally they are keyed and given a unique ID. The contents of the elements and the attributes, however, will add to the overall memory footprint.

If the size of the names is very large compared to their content, minifying the names will make your XML less readable, but also requires less I/O or network bandwidth. As such, in some cases, it may improve performance to use small names.

UTF-8 or UTF-16

Finally, I should add a note on the way you store it. Common sense says, store it as UTF-8. But that requires the parser to read each character and transform it in-memory to UTF-16. This costs time. Sometimes, a larger size of the file (for using UTF-16) can outperform a smaller size (with UTF-8) because the processor overhead is too big. Again, measuring your performance in several scenarios can help. Oh, and if you use a lot of (very) high characters, UTF-16 should be the preferred choice, because UTF-8 may use 3, 4 or even 6 bytes per character.

Summary

To sum it up, if speed is imperative and you cannot resort to a binary format:

Prefer attributes over elements, but only if DOM use is anticipated and searching / keying is not too important;
Prefer UTF-8 over UTF-16 only when the files are very large and you use few (very) high characters, measure to find out;
Prefer streaming over DOM for all your uses (LINQ typically uses streaming);
Don't bother using small names unless your I/O is really a bottleneck and the factor data:overhead is very large;
Define a few typical usage scenarios and measure;

PS: the above is what comes to mind when thinking about XML, there may, of course, be many other factors the improve / degrade performance, the largest perhaps your own skills in writing the best procedures for your CRUD operations.

Abel 2010-08-22 16:44:31

ansaurus

tags:

views:

answers: