tags:

views:

722

answers:

5

I'd like to strip out occurrences of a specific tag, leaving the inner XML intact. I'd like to do this with one pass (rather than searching, replacing, and starting from scratch again). For instance, from the source:

<element>
    <RemovalTarget Attribute="Something">
      Content Here
    </RemovalTarget>
</element>
<element>
  More Here
</element>

I'd like the result to be:

<element>
  Content Here
</element>
<element>
  More Here
</element>

I've tried something like this (forgive me, I'm new to Linq):

var elements = from element in doc.Descendants()
         where element.Name.LocalName == "RemovalTarget"
         select element;

foreach (var element in elements) {
    element.AddAfterSelf(element.Value);
    element.Remove();
}

but on the second time through the loop I get a null reference, presumably because the collection is invalidated by changing it. What is an efficient way to make remove these tags on a potentially large document?

+3  A: 

Have you considered using XSLT? Seems like the perfect soution, as you are doing exactly what XSLT is meant for, transforming one XML doc into another. The templating system will delve into nested nastiness for you without problems.

Here is a basic example

Andrew Bullock
A: 

I would recommend either doing XSLT as Trull recommended as the best solution.

Or you might look at using a string builder and regex matching to remove the items.

You could look at walking through the document, and working with nodes and parent nodes to effectively move the code from inside the node to the parent, but it would be tedious, and very un-necessary with the other potential solutions out there.

Mitchel Sellers
A: 

A lightweight solution would be to use XmlReader to go trough the input document and XmlWriter to write the output.

Note: XmlReader and XmlWriter clases are abstract, use the appropriate for your situation derived classes.

Sunny
+2  A: 

You'll have to skip the deferred execution with a call to ToList, which probably won't hurt your performance in large documents as you're just going to be iterating and replacing at a much lower big-O than the original search. As @jacob_c pointed out, I should be using element.Nodes() to replace it properly, and as @Panos pointed out, I should reverse the list in order to handle nested replacements accurately.

Also, use XElement.ReplaceWith, much faster than your current approach in large documents:

var elements = doc.Descendants("RemovalTarget").ToList().Reverse();
/* reverse on the IList<T> may be faster than Reverse on the IEnumerable<T>,
 * needs benchmarking, but can't be any slower
 */

foreach (var element in elements) {
    element.ReplaceWith(element.Nodes());
}

One last point, in reviewing what this MAY be used for, I tend to agree with @Trull that XSLT may be what you're actually looking for, if say you're removing all say <b> tags from a document. Otherwise, enjoy this fairly decent and fairly well performing LINQ to XML implementation.

sixlettervariables
.Value won't work if the RemovalTarget element contains child elements
Jacob Carpenter
A: 

Depending on how you manage your XML, you could use a regular expression to remove the tags.

Here's a simple console application that demonstrates the use of a regex:

    static void Main(string[] args)
    {
        string content = File.ReadAllText(args[0]);

        Regex openTag = new Regex("<([/]?)RemovalTarget([^>]*)>", RegexOptions.Multiline);

        string cleanContent = openTag.Replace(content, string.Empty);
        File.WriteAllText(args[1], cleanContent);
    }

This leaves newline characters in the file, but it shouldn't be too difficult to augment the regular expression.

Philipp Schmid
Processing XML as string data is very simple if you have control over your source XML and fraught with innumerable complexities if you don't. XML in the wild contains CDATA and comments, and those introduce so many special cases that it's usually best to stick with DOM-based approaches.
Robert Rossney