views:

171

answers:

3

The application I work uses XML for save/restore purposes. Here's an example snippet:

<?xml version="1.0" standalone="yes"?>
<itemSet>
<item handle="2" attribute1="30" attribute2="blah"></item>
<item handle="5" attribute1="27" attribute2="blahblah"></item>
</itemSet>

I want to be able to efficiently pre-process the XML which I read in from the configuration file. In particular, I want to extract the handle values from the example configuration above.

Ideally, I need a function/method to be able to pass in an opaque XML string, and return all of the handle values in a list. For the above example, a list containing 2 and 5 would be returned.

I know there's a regular expression out there that will help, but is it the most efficient way of doing this? String manipulation can be costly, and there may be potentially 1000s of XML strings I would need to process in a configuration file.

A: 

I'd guess a regex of some sort is going to be your best option for efficiency. It's going to be faster than parsing the XML into any sort of structural construct, and as long as you can extract all the information you will need in one pass, it's likely the most efficient method.

Nick
+4  A: 

You are looking for a stream oriented XML parser that reads each node in your XML one at a a time rather then loading the whole thing into memory.

One of the best known is the SAX - Simple API for XML

Here's a good article describing why to use SAX and also specific of using SAX in C++.

You can think of SAX as a parser of XML that only loads the bare minimum into memory and so works well on very large XML documents. As compare to the Regex or DOM approach that will require you to load the entire document into memory.

Ash
+1 for this method also being future proof. Someone will eventually add a handle attribute that you don't want and if you are already using SAX to parse the file, it will be 100x easier to work around the handle attribute you don't want.
jmucchiello
Thanks for the suggestion. Each XML document is reasonably small in size, so I'm not really concerned about memory usage. Basically, I would be able to read a string into memory, process it, and then free it. I'm more concerned with the NUMBER of documents I need to process, rather than...
LeopardSkinPillBoxHat
...the SIZE of an individual document. For this reason, maybe a simple regex in the code would be more efficient than adding new dependencies on external APIs into my project?
LeopardSkinPillBoxHat
FYI, you can get/use a regex that will not load the entire file, but work on the FILE* instead (using sequential or random access, depending on the iterator type). I don't recall if this was native in boost, but I know it's possible due to the generic API.
Nick
@LeopardSkinPillBoxHat, It depends if you have control over the input XML structure or if it is ever likely to change. If Yes to either, then SAX is much more future proof. Maybe you can't, but why use XML here, it seems a bit of overkill for your situation, what about simple flat files.
Ash
Using regexes or string functions is generally the wrong answer for XML processing, because of all the XML issues that you don't think of when writing them (whitespace handling, character escaping, namespaces, etc. etc.). SAX is the way to go. Not just future-proof but present-proof.
Robert Rossney
@Ash - I agree, XML may be overkill for the implementation. Unfortunately I am stuck with it for backwards compatibility reasons.
LeopardSkinPillBoxHat
A: 

It would be hard to beat something like:

/* untested code */
using std::string;
size_t pos = 0;
vector<int> handles;
while ((pos = xmlstr.find("handle=\"", pos)) != string::npos) {
  handles.push_back(atoi(xmlstr.data() + pos + 7));
}

It would be more efficient if handles.reserve() were called with the proper size, or perhaps if handles were a deque or list, depending on how it needs to be used later. This is unsafe code if the xml string might be malformed (xmlstr.data() is not null-terminated, so atoi might go off the end of the array). It also doesn't check that handle isn't the end of a longer attribute name, or indeed whether it is actually an attribute.

Using a regex library for a regular expression like "\\bhandle=\"\\d+\"" is likely to get you results nearly as fast with less likelihood of error. It still doesn't confirm that handle is an attribute; you have to judge if that's likely to be a problem.

ruds