tags:

views:

76

answers:

6

http://stackoverflow.com/questions/2282614/regular-expression-to-match-chars-that-appear-inside-xml-nodes

I have an almost indentical problem to this - however, I am using C#.

I'm not here to argue the validity of the XML.

What gets sent in is out of my control.

Input XML:

<PNODE> 
  <CNODE>This string contains > and < and & chars.</cnode> 
</PNODE> 

I need it to look like this:

<PNODE> 
  <CNODE>This string contains &gt; and &lt and &amp; chars.</CNODE> 
</PNODE> 

It looks like the guy found a solution for PHP- which doesn't help me.

However, I need to find a way escape the &, > and < characters inside the node, but leave the tag declarations alone.

A: 

There's a couple of .Net wrappers around the tidy library.

http://users.rcn.com/creitzel/tidy.html#dotnet

http://www.codeproject.com/KB/mcpp/eftidynet.aspx

And there is a .Net Port of tidy.

klabranche
While I'm sure I could find a .Net Tidy solution, which would hopefully work - is there another way to do this that doesn't require the Tidy Library or any other third party additions?
As mentioned by @md5sum, although possible to do on your own, why reinvent the wheel. Do you have an underlying requirement where you can't use a 3rd party library / solution? Especially like Tidy .Net since it's open source?
klabranche
A: 

Use the HTTPUtility.

HttpUtility.HtmlEncode("<text to Encode>");
mledbetter
This will destroy `<foo><bar>uh oh bad > < He needs to fix the bad characters without damaging `<foo>` and `<bar>`.
Anthony Pegram
+1  A: 

Check out Tidy.Net. It's a .Net implementation of Tidy.

md5sum
While I'm sure I could find a .Net Tidy solution, which would hopefully work - is there another way to do this that doesn't require the Tidy Library or any other third party additions?
md5sum
Simply put, I wouldn't try to reinvent the wheel.
md5sum
A: 

You should have a look at SgmlReader:

http://developer.mindtouch.com/SgmlReader

It will give you exactly what you wants :) I use it here: http://www.xmltools.dk/HtmlToXml try it :) (you can disable the html tag and the uppercase-tags->lowercase-tags conversion.)

lasseespeholt
A: 

I've always just used replace for XML (saves me having to bring in HTTP libraries):

string output = inputXml.Replace("&", "&amp;")
                        .Replace("<", "&lt;")
                        .Replace(">", "&tg;")
                        .Replace("'", "&apos;")     // optional
                        .Replace("\"", "&Quot;")    // optional
Zippit
If he were goign to do this he should probably just use `HTTPUtility.HtmlEncode` but it's already been established that this won't work...
Abe Miessler
but this will affect your nodes as well. I missed that part of your question.
Zippit
inputXml is the *entire* xml. This will replace perfectly good XML along with the unwanted characters in the content.
Anthony Pegram
A: 

I'm not here to argue the validity of the XML.

As with that other question, the right answer is that what you got sent is not XML. It's a question of well-formedness, not a question of validity in the XML sense.

What gets sent in is out of my control.

That may be true, but if someone sent you a quart of used motor oil and asked you to transform it into HTML, would you still accept it? Usually data interchange is done based on a contract (formal or informal), that the interchanged data will adhere to certain criteria. If it doesn't live up to the agreed-upon criteria, the data can be sent back, rejected.

If you're not requiring XML as input, this question is not about "<, & chars that appear inside XML nodes". Rather, it's about parsing SGML that looks a lot like XML, but which has < and & chars that appear in text content.

And to do that, .NET Tidy and SGMLReader are good solutions, as others have said.

LarsH