tags:

views:

19

answers:

2

Let's say I have the following body of text:

Call me Ishmael. Some years ago- never mind how long precisely- having little 
or no money in my purse, and nothing particular to interest me on shore, I 
thought I would sail about a little and see the watery part of the world. It is  
<?xml version="1.0" encoding="utf-8"?>
<RootElement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;
   <ChildElement />
   <ChildElement />
</RootElement>
a way I have of driving off the spleen and regulating the circulation. Whenever  
I find myself growing grim about the mouth; whenever it is a damp, drizzly 
November in my soul; 

What regex could I use that would return to me the XML embedding in the string?

NOTE: I can assume that <RootElement> and </RootElement> will always have the same name.

+2  A: 

If you know that the root element will always be <RootElement ...> and that there will never be a nested <RootElement> tag, you can do it like this:

\<\?xml .+?\</RootElement\>

This regex will lazily match all text between <?xml and </RootElement>.

SLaks
\<\?xml[^>]*\?>\s*<RootElement\s+.+?\</RootElement\> seems safer, just in case there is another \<\?xml in there, but generally xml and regexps don't mix too well.
Radomir Dopieralski
@Radomir I don't intend to *parse* the xml with regex. I just want to extract the XML out so that I can feed it into an XML parser.
Ben McCormack
Yes, that's why I deleted my initial answer :)
Radomir Dopieralski
+1  A: 

I understand that the root element will not always be called RootElement, so you can use

<\?xml[^>]+>\s*<\s*(\w+).*?<\s*/\s*\1>

using RegexOptions.SingleLine. This will take the first tag name after the opening ´` tag and capture everything until the matching tag.

In C#:

resultString = Regex.Match(subjectString, @"<\?xml[^>]+>\s*<\s*(\w+).*?<\s*/\s*\1>", RegexOptions.Singleline).Value;
Tim Pietzcker