ansaurus

Question

What regex could I use to extract a body of XML text from a body of unformatted text?

Answer 1

+2 A:

If you know that the root element will always be <RootElement ...> and that there will never be a nested <RootElement> tag, you can do it like this:

\<\?xml .+?\</RootElement\>

This regex will lazily match all text between <?xml and </RootElement>.

SLaks 2010-09-16 17:48:37

\<\?xml[^>]*\?>\s*<RootElement\s+.+?\</RootElement\> seems safer, just in case there is another \<\?xml in there, but generally xml and regexps don't mix too well.

Radomir Dopieralski 2010-09-16 17:53:12

@Radomir I don't intend to *parse* the xml with regex. I just want to extract the XML out so that I can feed it into an XML parser.

Ben McCormack 2010-09-16 17:56:40

Yes, that's why I deleted my initial answer :)

Radomir Dopieralski 2010-09-16 18:04:01

Answer 2

+1 A:

I understand that the root element will not always be called RootElement, so you can use

<\?xml[^>]+>\s*<\s*(\w+).*?<\s*/\s*\1>

using RegexOptions.SingleLine. This will take the first tag name after the opening ´` tag and capture everything until the matching tag.

In C#:

resultString = Regex.Match(subjectString, @"<\?xml[^>]+>\s*<\s*(\w+).*?<\s*/\s*\1>", RegexOptions.Singleline).Value;

Tim Pietzcker 2010-09-16 17:55:41

ansaurus

tags:

views:

answers:

What regex could I use to extract a body of XML text from a body of unformatted text?

related questions