ansaurus

Question

Answer 1

+3 A:

If you're sure the content is XHTML (i.e. well-formed XML) then XPath can certainly do it.

var doc = new XmlDocument();
doc.LoadXml("<span tag=...");

foreach(var node in doc.SelectNodes("//span[tag=x]"))
{
    node.InnerXml = "New Content";
}
foreach(var node in doc.SelectNodes("//span[tag=y]"))
{
    node.InnerXml = "Different Content";
}

Dean Harding 2010-05-24 01:32:56

Thank you for the answer plus the sample code. Much appreciated

Daveo 2010-05-24 03:36:57

Answer 2

A:

You can surely do this using regular expressions (it is a string manipulation afterall), but that may get a bit nasty, because HTML can be quite complicated. However, it is certainly a possible approach.

An alternative would be to parse the XHTML page into some structured hieararchy and then do the processing. The question is whether the pages are really valid XML. The XHTML specification requires that, but if you'll pick random page from the internet that claims to be XHTML, you may run into troubles.

If no, then you need to parse them as HTML, which can be done using Html Agility Pack.
If yes, then you can treat it as XML and use standard .NET classes to parse it.

The second case could be done using LINQ to XML like this:

var xs = from span in doc.Descendant("span")
         let tag = span.Attribute("tag")
         where tag != null && tag.Value == "x" select span;
forach(var x in xs) x.Value = "BAR!";

The obvious benefit is that this is much more readable and maintainable than a solution that would use regular expressions. Html Agility Pack provides a similar API (although I'm not familiar with it to write a sample).

Tomas Petricek 2010-05-24 01:33:39

[No](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). **You CAN'T do it with regular expressions**.

SLaks 2010-05-24 01:35:10

This has to be linked when HTML and RegEx are mentioned in the same answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Nick Craver 2010-05-24 01:37:15

Hehe, great reference :-), but there _are_ cases where I would use regular expressions (if it wasn't _really_ XML and I needed a quick hack rather than solid solution). The title should really be **You'll burn in hell if you do it using regular expressions**. To me, "can't" and "regular expressions" in one sentence suggests that there should be a proof ;-)

Tomas Petricek 2010-05-24 01:39:42

-1: Tomas, I thought you'd know better.

John Saunders 2010-05-24 01:39:54

Tomas: the OP stated XHTML.

John Saunders 2010-05-24 01:40:11

@John Saunders: I see that he means "XHTML", but this is the world of so called "web standards".

Tomas Petricek 2010-05-24 01:42:16

@Tomas: I think there's a fair chance that something calling itself XHTML will a some point be consumed by an XML parser, which, if it's not valid XML, will tell you. I see no reason to confuse readers by suggesting there are valid times to use regular expressions when parsing XHTML.

John Saunders 2010-05-24 01:45:35

Yes I give one vote for Tomas as it is a valid point the file may not be valid XML ( I will have to double check this as it is user provided content from ckEditor) Thanks for providing the LINQ code sample and showing me about Html Agility Pack. Thank you,

Daveo 2010-05-24 03:44:42

ansaurus

tags:

views:

answers:

XML/XHTML replace content?

related questions