views:

726

answers:

2

So I'm trying to parse some XML, the creation of which is not under my control. The trouble is, they've somehow got nodes that look like this:

<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(MORNINGSTAR) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(QUARTERSTAFF) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(SCYTHE) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(TRATNYR) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(TRIPLE-HEADED_FLAIL) />
<ID_INTERNAL_FEAT_FOCUSED_EXPERTISE_(WARAXE) />

Visual Studio and .NET both feel that the '(' and ')' characters, as used above, are totally invalid. Unfortunately, I need to process these files! Is there any way to get the Xml Reader classes to not freak out at seeing these characters, or dynamically escape them or something? I could do some sort of pre-processing on the whole file, but I DO want the '(' and ')' characters if they appear inside the node in some valid way, so I don't want to just remove them all...

+7  A: 

That simply isn't valid. Pre-processing is your best-bet, perhaps with regex - something like:

string output = Regex.Replace(input, @"(<\w+)\((\w+)\)([ >/])", "$1$2$3");

Edit: a bit more complex to replace the "-" inside the brackets:

string output = Regex.Replace(input, @"(<\w+)\(([-\w]+)\)([ >/])",
    delegate(Match match) {
        return match.Groups[1].Value + match.Groups[2].Value.Replace('-', '_')
             + match.Groups[3].Value;
    });
Marc Gravell
I would try for the most restrictive regex possible
Dolphin
@Dolphin - care to provide a concrete suggestion?
Marc Gravell
The regex mostly works, but somehow the second-to-last node in the above example survives with '(' and ')' intact :\
GWLlosa
Add a -, then (will update)
Marc Gravell
Thanks. I'm the third-party to this XML document, so I'm not really in a position to demand fixes to it. But with the regex, now it works.
GWLlosa
+2  A: 

If it isn't syntactically valid, it's not XML.

XML is very strict about this.

If you can't get the sending application to send correct XML, then just let them know that whatever downstream process sees this will fail, whether it's yours or some other app in the future.

If preprocessing isn't an option, another clever mechanism is to wrap the Stream object that is passed to the parser with a custom stream. That stream could look for < characters, and when it sees one, set a flag. Until a > character is see, it could eat any ( or ) characters. We've used something like this to get rid of NUL and ^Z characters added to an XML file by a legacy transport mechanism. (The only gotcha there might be < characters inside of an attribute, since they don't have to be escaped there - only > characters do.)

lavinio