views:

92

answers:

3

Hi,

I need to remove the tag "image" with regex.

I'm working with C# .Net

example <rrr><image from="91524" to="92505" /></rrr> should become:

<rrr></rrr>

Anyone???

+5  A: 

You shouldn't really be using regex for this task, especially when .NET provides such powerful tools to handle XML:

XElement xml = XElement.Parse("<rrr><image from=\"91524\" to=\"92505\" /></rrr>");
xml.Descendants("image").Remove();

However if you insist on doing this with regex, let's see what happens:

string xml = "<rrr><image from=\"91524\" to=\"92505\" /></rrr>";
string output = Regex.Replace(xml, "<image.*?>", "");

This method has some problems though that the first method solves for you. Example problems:

  • Doesn't handle case sensitivity.
  • > characters in attributes can confuse the regex.
  • Newlines won't be matched correctly.
  • Incorrectly matches other tags that start with image like <image2 />.
  • XML comments can cause problems.
  • Doesn't handle both <image /> and <image></image>.
  • etc...

Some of these are easy to fix, some are more tricky. But in the end it's not worth spending time improving the regular expression solution to handle all the special cases when the LINQ to XML solution is so simple and does all this for you.

Mark Byers
A: 

Try this:

<image[^>]*>

Skilldrick
A: 

Even though XML is very regular and suffers from a draconian "validate or die" policy, this Stack Overflow question will prove very enlightening.

Regular expressions are powerful--but the XML tools in .NET are better for this task, because they are designed to handle this sort of thing. You can manipulate the XML based upon its structure, something Regexes can't do because they see your XML as text.

XML is text, but it's text with a particular structure. Take advantage of that known quality.

Broam