ansaurus

Question

How To Parse XML With Invalid Characters in Node Name?

Answer 1

+7 A:

That simply isn't valid. Pre-processing is your best-bet, perhaps with regex - something like:

string output = Regex.Replace(input, @"(<\w+)\((\w+)\)([ >/])", "$1$2$3");

Edit: a bit more complex to replace the "-" inside the brackets:

string output = Regex.Replace(input, @"(<\w+)\(([-\w]+)\)([ >/])",
    delegate(Match match) {
        return match.Groups[1].Value + match.Groups[2].Value.Replace('-', '_')
             + match.Groups[3].Value;
    });

Marc Gravell 2009-07-01 13:25:19

I would try for the most restrictive regex possible

Dolphin 2009-07-01 13:46:15

@Dolphin - care to provide a concrete suggestion?

Marc Gravell 2009-07-01 14:16:01

The regex mostly works, but somehow the second-to-last node in the above example survives with '(' and ')' intact :\

GWLlosa 2009-07-01 14:48:44

Add a -, then (will update)

Marc Gravell 2009-07-01 14:57:53

Thanks. I'm the third-party to this XML document, so I'm not really in a position to demand fixes to it. But with the regex, now it works.

GWLlosa 2009-07-02 11:38:51

Answer 2

+2 A:

If it isn't syntactically valid, it's not XML.

XML is very strict about this.

If you can't get the sending application to send correct XML, then just let them know that whatever downstream process sees this will fail, whether it's yours or some other app in the future.

If preprocessing isn't an option, another clever mechanism is to wrap the Stream object that is passed to the parser with a custom stream. That stream could look for < characters, and when it sees one, set a flag. Until a > character is see, it could eat any ( or ) characters. We've used something like this to get rid of NUL and ^Z characters added to an XML file by a legacy transport mechanism. (The only gotcha there might be < characters inside of an attribute, since they don't have to be escaped there - only > characters do.)

lavinio 2009-07-01 13:28:21

ansaurus

tags:

views:

answers:

How To Parse XML With Invalid Characters in Node Name?

related questions