views:

517

answers:

4

I need to write an application that fetches element name value (time-series data) pair from any xml source, be it file, web server, any other server. the application would consume the XML and take out values of interest, it has to be very very fast (lets say 50000 events/seconds or more) also the XML document size would be huge and frequency of these document could be high as well (for ex. 2500 files/min - more than 500MB of XML data/file).

I just want to see how you experienced people think I should approach this. I am a novice who just got started although I can do any solution you suggest me, no matter how tough/easy.

Thank you very much.

+4  A: 

If you use SAX parsing, your bottleneck is the I/O involved, not the XML string processing. And given your 500 MB number, I'd say you'd have to do SAX parsing instead of DOM parsing. So, anything with a SAX type interface should be just fine.

Warren Young
Poco's XML library has a nice SAX parser.
StackedCrooked
+2  A: 

I'm a fan of Xerces, I think you are going to have to try them out to see what has the best performance for your application. Like Warren said you will want to use SAX processing. Realistically if you truly need the performance you should use a specialized XML appliance to do the processing.

ewrankin
A: 

I use libxml2 in our projects. It supports both SAX and DOM. As Warren Young said, you should use SAX. You could give Expat a try.

ZHENG Zhong
+1  A: 

Expat is one of the fastest non-validating XML parsers. If you need validation (e.g., XML Schema), then your only choice for a general-purpose C++ XML parser is Xerces-C++. Though it is not particularly fast.

Another alternative would be to use an XML data binding tool that supports in-generated-code validation. In other words, the tool will generate validation code specifically for your schema. This method is generally quite a lot faster. Another benefit is that data conversion (e.g., from string to int) is performed automatically and only once. One such tool is XSD/e (full disclusure: I work on this project). In particular, the C++/Hybrid mapping can be useful in your case since it allows partially event-driven/partially in memory processing, filtering, etc.

HTH, Boris

Boris Kolpackov
I did data binding and generated the code to read specifically my schema. let's see how it works , thanks !
Gollum