ansaurus

Question

Efficient way of extracting specific numerical attributes from XML

Answer 1

A:

I'd guess a regex of some sort is going to be your best option for efficiency. It's going to be faster than parsing the XML into any sort of structural construct, and as long as you can extract all the information you will need in one pass, it's likely the most efficient method.

Nick 2009-02-16 04:33:21

Answer 2

+4 A:

You are looking for a stream oriented XML parser that reads each node in your XML one at a a time rather then loading the whole thing into memory.

One of the best known is the SAX - Simple API for XML

Here's a good article describing why to use SAX and also specific of using SAX in C++.

You can think of SAX as a parser of XML that only loads the bare minimum into memory and so works well on very large XML documents. As compare to the Regex or DOM approach that will require you to load the entire document into memory.

Ash 2009-02-16 04:34:10

+1 for this method also being future proof. Someone will eventually add a handle attribute that you don't want and if you are already using SAX to parse the file, it will be 100x easier to work around the handle attribute you don't want.

jmucchiello 2009-02-16 05:08:07

Thanks for the suggestion. Each XML document is reasonably small in size, so I'm not really concerned about memory usage. Basically, I would be able to read a string into memory, process it, and then free it. I'm more concerned with the NUMBER of documents I need to process, rather than...

LeopardSkinPillBoxHat 2009-02-16 05:26:00

...the SIZE of an individual document. For this reason, maybe a simple regex in the code would be more efficient than adding new dependencies on external APIs into my project?

LeopardSkinPillBoxHat 2009-02-16 05:26:37

FYI, you can get/use a regex that will not load the entire file, but work on the FILE* instead (using sequential or random access, depending on the iterator type). I don't recall if this was native in boost, but I know it's possible due to the generic API.

Nick 2009-02-16 06:58:55

@LeopardSkinPillBoxHat, It depends if you have control over the input XML structure or if it is ever likely to change. If Yes to either, then SAX is much more future proof. Maybe you can't, but why use XML here, it seems a bit of overkill for your situation, what about simple flat files.

Ash 2009-02-16 07:33:30

Using regexes or string functions is generally the wrong answer for XML processing, because of all the XML issues that you don't think of when writing them (whitespace handling, character escaping, namespaces, etc. etc.). SAX is the way to go. Not just future-proof but present-proof.

Robert Rossney 2009-02-16 20:27:05

@Ash - I agree, XML may be overkill for the implementation. Unfortunately I am stuck with it for backwards compatibility reasons.

LeopardSkinPillBoxHat 2009-02-16 22:23:13

Answer 3

A:

It would be hard to beat something like:

/* untested code */
using std::string;
size_t pos = 0;
vector<int> handles;
while ((pos = xmlstr.find("handle=\"", pos)) != string::npos) {
  handles.push_back(atoi(xmlstr.data() + pos + 7));
}

It would be more efficient if handles.reserve() were called with the proper size, or perhaps if handles were a deque or list, depending on how it needs to be used later. This is unsafe code if the xml string might be malformed (xmlstr.data() is not null-terminated, so atoi might go off the end of the array). It also doesn't check that handle isn't the end of a longer attribute name, or indeed whether it is actually an attribute.

Using a regex library for a regular expression like "\\bhandle=\"\\d+\"" is likely to get you results nearly as fast with less likelihood of error. It still doesn't confirm that handle is an attribute; you have to judge if that's likely to be a problem.

ruds 2009-02-16 05:02:47

ansaurus

tags:

views:

answers:

Efficient way of extracting specific numerical attributes from XML

related questions