views:

317

answers:

7

A long time ago Joel explained how various every-day coding things were slow, and this led to XML as a data store being slow: Back to Basics

Are those every-day coding things - strcat and malloc - still slow in a std::string and dlmalloc world? What else has changed in modern processors and mainstream frameworks?

And is XML still slow? You can't find an RDBMS that doesn't claim some kind of native XML support these days; haven't they got it faster - a single pass to index it for example - yet?


I think that the main point of Joel's article was that there was a danger if programmers didn't understand what was happening under the hood; nobody really disputes that point, and I'm mostly curious what people think of the examples he gave to support this argument; and what examples ought to be used in their place today?

+2  A: 

How slow is slow? XML per se is neither slow nor fast, it's the program processing it that is slow or fast. Of course you could index an XML file in a single pass, but then, you could also read it and store it in a proper database along the way. Either way, you have to do the single pass full file scan, which is more expensive than having an already indexed file.

ammoQ
+1  A: 

Well, the answer is, slow compared to what?

XML, no matter whether native or not, is text, and text processing will be slower than a custom binary format, because the strings have to be processed and recognized. Things which could be contained in a byte in binary, may take a string or several strings in xml.

XML is convenient, human readable, portable, which are strong advantages. And for many cases, XML will be the right answer.

But for many other cases, you may want something like protocol buffers, to store things in a binary format, where speed is of the essence.

Rob Lachlan
+1 Text will always be slower then binary.
KMan
+1  A: 

for some benchmark numbers (remember though that benchmarks most often don't mirror real-life performance) see http://stackoverflow.com/questions/296650/performance-comparison-of-thrift-protocol-buffers-json-ejb-other. "Java" is the native built-in Java XML serialization.

The answer will depend on what you're using the XML for. For example, I'm currently working on an application where the plaintext messages are around 5k in size, but after expanding to XML wind up around 70k. the bloat in size is more of a problem for me than serialization or deserialization speed.

Jimmy
+1  A: 

Yes, std::string is faster then C strings with strcat, strlen and a RDBMS is faster than a giant XML file. But that wasn't the point of Joel's article. Personally, I don't really know what the point of that article is (something about how beginner programmers, who have only ever used high-level languages, don't understand how CPUs work "under the hood"), but it's really not about how "XML is slow" or whatever...

Dean Harding
The XML as a data store being slow it at the end of the article, and even if it wasn't the main point of the article, it was the thing I wanted to pick over.
Will
Isn't the reason std::string is faster simply because it stores the length in addition to the characters?
phkahler
@phkahler: that's correct, yes.
Dean Harding
how does that make it faster?
Cheeso
@Cheeso: did you read the article? It specifically talks about doing repeated `strcat` s. By storing the length as well, repeated `strcat` s are O(n).
Dean Harding
Oh, ok, I understand now. (I hadn't read it.)
Cheeso
+2  A: 

i don't know how slow is xml, but it sure is very big. anything that deals with large text file(could be faster if it is fixed-width column) takes a lot of time to parse.

point of reference how big xml is, see Hanselman's reaction on xml to database conversion:

If you care, that 3 gigs becomes a 250 meg SQL Server Database. Darn you XML! ;)

i just don't know why he removed that phrase later on :-)

here's the cache:

http://webcache.googleusercontent.com/search?q=cache:rrp2JBVq-bkJ:www.hanselman.com/blog/CommentView.aspx%3Fguid%3Dd7e873d4-8e68-4d8d-883c-093ed2acc791+%22darn+you+xml%22+hanselman&cd=1&hl=tl&ct=clnk&gl=ph

Michael Buen
Hey, but XML shrinks dramatically with compression. Just add more complexity and slowness and you get that space back!
phkahler
+1  A: 

Comparing XML to RDBMS is, in my opinion, like comparing a filing cabinet with rail road network.

XML is great in terms that it is human readable and easily extensible.

It works on small scale of complexity, but I really fail to see, no matter how hard I try, any benefits of XML for structures that model several hundred entities (which some people would call only mid sized data store)

Still, the most painful of all XML traits, IMO, is that it brings us back to deciding how to physically store data and how to retrieve it, erasing the difference between logical and physical design. We have to deal with indexing, locking, integrity, parsing. And the choices you make there are effecting your applications either requiring a lot of rewriting and data reshaping for any optimization and changes OR kills the performance with vengence.

On the other hand SQL standard, even though it is great, have stalled for decades, and I am only sorry to see that big players don't recognize that it is not relational model that is not progressing, but it is SQL as its implementation.

Unreason
Someone once suggested to me (in 1995) that every programming language should have RDBMS functions built-in. Unfortunately it's still an external package and requires SQL. I always used the visual query builder in MS access to get my SQL for VBA apps using access databases. Anyway, I agree this stuff needs to be much more commonplace.
phkahler
+2  A: 

"Are those every-day coding things - strcat and malloc - still slow in a std::string and dlmalloc world?"

You might want to re-read Joels articles, in particular thew one about leaky abstractions. Concatenating a large number of strings without thinking about the actual algorithm will always be slow. Fundamentally, the question behind the scenes is: how often is each character copied? It doesn't matter whether the abstraction is called std::string or strcat, if the answer is not 1 your algorithm is needlessly slow.

As for malloc, there Joel's just assuming an old UNIX malloc implementation. Just checking Microsofts sbheap.c will prove those assumptions outdated. On many Linux implementations, malloc is dlmalloc.

And for the same fundamental reasons, XML is still slow. In an XML document, you can't predict byte offsets, and you have to use a slow algorithm to locate data.

MSalters