tags:

views:

50

answers:

2

I have a XHTML string I want to replace tags in for example

<span tag="x">FOO</span> 
<span tag="y"> <b>bar</b> some random text <span>another span</span> </span>

I want to be able to find tag="x" and replace FOO with my own content and find tag=y and replace all the inner content with by own content.

What is the best way to do this? I am thinking regex is definitely out of the question. Can XPATH do this or is that just for searching can it do manipulation?

+3  A: 

If you're sure the content is XHTML (i.e. well-formed XML) then XPath can certainly do it.

var doc = new XmlDocument();
doc.LoadXml("<span tag=...");

foreach(var node in doc.SelectNodes("//span[tag=x]"))
{
    node.InnerXml = "New Content";
}
foreach(var node in doc.SelectNodes("//span[tag=y]"))
{
    node.InnerXml = "Different Content";
}
Dean Harding
Thank you for the answer plus the sample code. Much appreciated
Daveo
A: 

You can surely do this using regular expressions (it is a string manipulation afterall), but that may get a bit nasty, because HTML can be quite complicated. However, it is certainly a possible approach.

An alternative would be to parse the XHTML page into some structured hieararchy and then do the processing. The question is whether the pages are really valid XML. The XHTML specification requires that, but if you'll pick random page from the internet that claims to be XHTML, you may run into troubles.

  • If no, then you need to parse them as HTML, which can be done using Html Agility Pack.
  • If yes, then you can treat it as XML and use standard .NET classes to parse it.

The second case could be done using LINQ to XML like this:

var xs = from span in doc.Descendant("span")
         let tag = span.Attribute("tag")
         where tag != null && tag.Value == "x" select span;
forach(var x in xs) x.Value = "BAR!";

The obvious benefit is that this is much more readable and maintainable than a solution that would use regular expressions. Html Agility Pack provides a similar API (although I'm not familiar with it to write a sample).

Tomas Petricek
[No](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). **You CAN'T do it with regular expressions**.
SLaks
This has to be linked when HTML and RegEx are mentioned in the same answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
Nick Craver
Hehe, great reference :-), but there _are_ cases where I would use regular expressions (if it wasn't _really_ XML and I needed a quick hack rather than solid solution). The title should really be **You'll burn in hell if you do it using regular expressions**. To me, "can't" and "regular expressions" in one sentence suggests that there should be a proof ;-)
Tomas Petricek
-1: Tomas, I thought you'd know better.
John Saunders
Tomas: the OP stated XHTML.
John Saunders
@John Saunders: I see that he means "XHTML", but this is the world of so called "web standards".
Tomas Petricek
@Tomas: I think there's a fair chance that something calling itself XHTML will a some point be consumed by an XML parser, which, if it's not valid XML, will tell you. I see no reason to confuse readers by suggesting there are valid times to use regular expressions when parsing XHTML.
John Saunders
Yes I give one vote for Tomas as it is a valid point the file may not be valid XML ( I will have to double check this as it is user provided content from ckEditor) Thanks for providing the LINQ code sample and showing me about Html Agility Pack. Thank you,
Daveo