views:

2391

answers:

9

I need recommendations on what to use in Delphi (I use Delphi 2009) to handle very large XML files (e.g. 100 MB) as fast as possible.

I need to input the XML, access and update the data in it from my program, and then export the modified XML again.

Hopefully the input and output could be done within a few seconds on a fast Windows machine.


Clarification. I expect I will need to use DOM, because access to the data structure for developing reports and making updates to the data is important, and I need this functionality to be very fast.

The input is only done once for File Loading and the output done only for File saving, usually just once upon exit. These should be quick as well, but are not as important as the in-memory data access and update.

My understanding is that 3rd party parsers only help with input and output, but not on using and modifying the data once loaded into memory. Or am I mistaken on this?

+2  A: 

I'm not a specialist, but I believe the consensus is that a SAX parser will be far more efficient than DOM...

François
+3  A: 

You might want to have a look at the DIHtmlParser component from The Delphi Inspiration. It's supposed to be "extremely fast, especially when parsing huge files", and "on modern machines the score goes up to more than 15 MB of HTML data per second". I've had some pretty good experiences with it, although I've never tried it with huge files.

onnodb
I have used this on extremely large (> 100 MB) XHTML log files without any problems.
skamradt
+5  A: 

SAX is worth considering instead of a DOM parser.

With DOM you pay the overhead of loading up the document, but once loaded data can be accessed and updated quickly.

With SAX you have to write handlers for begin-element, end-element, etc, but you have much more flexibility in what you do as you go along.

Although it probably doesn't help your situation, SAX is very useful where you are searching because you can halt the parsing at any point, so once you have found what you wanted you can stop.

If your program does not need to have parsed all the data before it knows what changes to make, you could write SAX handlers that just updated the data when it was read and otherwise passed it through, so it would stream the data rather than having to load it all into any sort of memory structure. This would make the solution very scalable as you wont hit memory constraints with very large files.

For what it's worth, I tend to use the MSXML DOM and SAX parsers. It can be argued that they are not the best performing, I argue that there are probably more people working on improving them, so they will get better and better.

Richard A
+2  A: 

I'm very satisfied with NativeXML from SimDesign. It also includes a special version called FastXML, which I didn't test yet, but is told to be, well, fast.

Uwe Raabe
+6  A: 

If I understood your question correctly, you have known data structure and you are modifying data - not XML structure of file.

Under these condition and if performance is crucial, then you could try with direct text manipulation - skip XML parsing.

Read from stream, use some fast text search algorithm e.g. Boyer-Moore, to find places where you need to modify data, do your modification and output data into another stream.

This would be one-pass, no XML parsing, no in-memory XML tree building.

zendar
Actually, when the program starts I want to input the data into an in-memory data structure. Then while it is running I will be accessing that data many times over for various operations including allowing the user to update the data. On closing, the user will probably want to save his updates.
lkessler
... but I ended up using the direct text manipulation that you recommended, which definitely is as fast as you can get. So I'm giving you the accepted answer.
lkessler
+1  A: 

If you ever consider event driven SAX way, XML Parser library might come quite handy.

utku_karatas
A: 

Another possibility I just discovered, is with the LMD ElPack package that I purchased, they include an XML support library which they say "is extremely fast, fully unicode-enabled and adds only a small footprint to your Exe-files".

Looking at the source of their LMDXML.pas unit included in the LMD 7 package (for Delphi 2009), it says the code is based on SimpleXML Release 8.0 (July 2006) code by Michail Vlasov.

lkessler
A: 

Check out Fast Infoset, a standard for compressed XML messages, however it looks like a Delphi version is not yet available:

http://stackoverflow.com/questions/834104/is-there-a-fast-infoset-library-for-delphi

mjustin
A: 

If you need only direct manipulation i would agree with the answer by zendar.

As for the DOM or SAX implementation i would recommend DIXml.

ErvinS