views:

1355

answers:

8

I've got to get a quick and dirty configuration editor up and running. The flow goes something like this:

configuration (POCOs on server) are serialized to XML.
The XML is well formed at this point. The configuration is sent to the web server in XElements.
On the web server, the XML (Yes, ALL OF IT) is dumped into a textarea for editing.
The user edits the XML directly in the webpage and clicks Submit.
In the response, I retrieve the altered text of the XML configuration. At this point, ALL escapes have been reverted by the process of displaying them in a webpage.
I attempt to load the string into an XML object (XmlElement, XElement, whatever). KABOOM.

The problem is that serialization escapes attribute strings, but this is lost in translation along the way.

For example, let's say I have an object that has a regex. Here's the configuration as it comes to the web server:

<Configuration>
  <Validator Expression="[^&lt;]" />
</Configuration>

So, I put this into a textarea, where it looks like this to the user:

<Configuration>
  <Validator Expression="[^<]" />
</Configuration>

So the user makes a slight modification and submits the changes back. On the web server, the response string looks like:

<Configuration>
  <Validator Expression="[^<]" />
  <Validator Expression="[^&]" />
</Configuration>

So, the user added another validator thingie, and now BOTH have attributes with illegal characters. If I try to load this into any XML object, it throws an exception because < and & are not valid within a text string. I CANNOT CANNOT CANNOT CANNOT use any kind of encoding function, as it encodes the entire bloody thing:

var result = Server.HttpEncode(editedConfig);

results in

&lt;Configuration&gt;
  &lt;Validator Expression="[^&lt;]" /&gt;
  &lt;Validator Expression="[^&amp;]" /&gt;
&lt;/Configuration&gt;

This is NOT valid XML. If I try to load this into an XML element of any kind I will be hit by a falling anvil. I don't like falling anvils.

SO, the question remains... Is the ONLY way I can get this string XML ready for parsing into an XML object is by using regex replaces? Is there any way to "turn off constraints" when I load? How do you get around this???


One last response and then wiki-izing this, as I don't think there is a valid answer.

The XML I place in the textarea IS valid, escaped XML. The process of 1) putting it in the text area 2) sending it to the client 3) displaying it to the client 4) submitting the form it's in 5) sending it back to the server and 6) retrieving the value from the form REMOVES ANY AND ALL ESCAPES.

Let me say this again: I'M not un-escaping ANYTHING. Just displaying it in the browser does this!

Things to mull over: Is there a way to prevent this un-escaping from happening in the first place? Is there a way to take almost-valid XML and "clean" it in a safe manner?


This question now has a bounty on it. To collect the bounty, you demonstrate how to edit VALID XML in a browser window WITHOUT a 3rd party/open source tool that doesn't require me to use regex to escape attribute values manually, that doesn't require users to escape their attributes, and that doesn't fail when roundtripping (&amp;amp;amp;amp;etc;)

+7  A: 

Erm … How do you serialize? Usually, the XML serializer should never produce invalid XML.

/EDIT in response to your update: Do not display invalid XML to your user to edit! Instead, display the properly escaped XML in the TextBox. Repairing broken XML isn't fun and I actually see no reason not to display/edit the XML in a valid, escaped form.

Again I could ask: how do you display the XML in the TextBox? You seem to intentionally unescape the XML at some point.

/EDIT in response to your latest comment: Well yes, obviously, since the it can contain HTML. You need to escape your XML properly before writing it out into an HTML page. With that, I mean the whole XML. So this:

<foo mean-attribute="&lt;">

becomes this:

&lt;foo mean-attribute="&amp;&lt;"&gt;
Konrad Rudolph
Correct. Error in the question. Fix'd
Will
Believe me, when you take escaped xml and drop it in a TEXT AREA for display on a webpage, it renders the escapes as their unescaped counterparts. I'm not doing this on purpose.
Will
Sorry, had some confusion in there about textBLOCKS and textAREAS
Will
A: 

This special character - "<" - should have replaced with other characters so that your XML will be valid. Check this link for XML special characters:

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

Try also to encode your TextBlock content before sending it to the deserializer:

HttpServerUtility utility = new HttpServerUtility();
string encodedText = utility.HtmlEncode(text);
mnour
Why, yes, that's right. The question, however, is HOW to do this. Regex replace? Or is there a safer, more reliable way to do it?
Will
I have edited my answer and adding a sample code to encode the text before sending it to the serializer.
mnour
That escapes EVERYTHING, turning valid xml (with some bad attributes) into not-xml. This does not work.
Will
Wait: are you showing users the _Markup_ too? That invalidates most of my other response, but at least the regex in the question I linked to may still help you.
Joel Coehoorn
+1  A: 

As you say, the normal serializer should escape everything for you.

The problem, then, is the text block: you need to handle anything passed through the textblock yourself.

You might try HttpUtility.HtmlEncode(), but I think the simplest method is to just encase anything you pass through the text block in a CDATA section.

Normally of course I would want everything properly escaped rather than relying on the CDATA "crutch", but I would also want to use the built-in tools to do the escaping. For something that is edited in it's "hibernated" state by a user, I think CDATA might be the way to go.

Also see this earlier question:
http://stackoverflow.com/questions/157646/best-way-to-encode-text-data-for-xml


Update
Based on a comment to another response, I've realized you're showing the users the markup, not just the contents. Xml parsers are, well, picky. I think the best thing you could do in this case is to check for well-formedness before accepting the edited xml.

Perhaps try to automatically correct certain kinds of errors (like bad ampersands from my linked question), but then get the line number and column number of the first validation error from the .Net xml parser and use that to show users where their mistake is until they give you something acceptable. Bonus points if you also validate against a schema.

Joel Coehoorn
Yes, I am showing raw, sexxay naked markup in a TextBlock. As I said, quick and dirty configuration editor. Configuration = xml, editor = TextBlock + xml
Will
I'd like to avoid forcing users to escape stuff themselves. It comes out unescaped, meaning they have to weed through HUNDREDS of lines of XML to fix stuff that's broken BEFORE they change the config. NIGHTMARE.
Will
The idea here is that you could correct certain types of common errors for them, and at least show them where the problem is for errors if you can't correct it.
Joel Coehoorn
The big thing is that no matter what you shouldn't accept user input that will break the object, and there's no way you'll be able to account for every possible mistake someone could make an a document. So you will need to implement some validation logic anyway.
Joel Coehoorn
Quick and dirty; if they break it, its broken. I don't mind that. But just the ACT of displaying it in a webpage breaks it. I want to prevent that, at a minimum. Regex?
Will
This expression will find any ampersand that isn't part of an entity, which is against the rules in Xml: |#[0-9]{2,4};)
Joel Coehoorn
I would at least _warn_ the user that the submitted values are not valid: give them a choice to try to correct or submit anyway. But I see where you have a tougher problem: now theres no difference between a less than symbol that's used for a tag and one that's used for content
Joel Coehoorn
A: 

Is this really my only option? Isn't this a common enough problem that it has a solution somewhere in the framework?

private string EscapeAttributes(string configuration)
{
    var lt = @"(?<=\w+\s*=\s*""[^""]*)<(?=[^""]*"")";
    configuration = Regex.Replace(configuration, lt, "&lt;");

    return configuration;
}

(edit: deleted ampersand replacement as it causes problems roundtripping)

Will
Regex just isn't that good at matching xml/html.
Joel Coehoorn
I know. Its scary. That's why I'm amazed there isn't something that I can use alternatively.
Will
+5  A: 
bobince
Thanks, but not helpful.
Will
+1  A: 

You could take a look at something like TinyMCE, which allows you to edit html in a rich text box. If you can't configure it to do exactly what you want, you could use it as inspiration.

phsr
Considered, rejected. Also, "demonstrate how to edit VALID XML in a browser window WITHOUT a 3rd party/open source tool". thanks for the answer, anyhow.
Will
+1  A: 

Note: firefox (in my test) does not unescape in text areas as you describe. Specifically, this code:

<textarea cols="80" rows="10" id="1"></textarea>

<script>
elem = document.getElementById("1");

elem.value = '\
<Configuration>\n\
  <Validator Expression="[^&lt;]" />\n\
</Configuration>\
'
alert(elem.value);
</script>

Is alerted and displayed to the user unchanged, as:

<Configuration>
  <Validator Expression="[^&lt;]" />
</Configuration>

So maybe one (un-viable?) solution is for your users to use firefox.


It seems two parts to your question have been revealed:

1 XML that you display is getting unescaped.

For example, "&lt;" is unescaped as "<". But since "<" is also unescaped as "<", information is lost and you can't get it back.

One solution is for you to escape all the "&" characters, so that "&lt;" becomes "&amp;lt;". This will then be unescaped by the textarea as "&lt;". When you read it back, it will be as it was in the first place. (I'm assuming that the textarea actually changes the string, but firefox isn't behaving as you report, so I can't check this)

Another solution (mentioned already I think) is to build/buy/borrow a custom text area (not bad if simple, but there's all the editing keys, ctrl-C, ctrl-shift-left and so on).

2 You would like users to not have to bother escaping.

You're in escape-hell:

A regex replace will mostly work... but how can you reliably detect the end quote ("), when the user might (legitimately, within the terms you've given) enter :

<Configuration>
  <Validator Expression="[^"<]" />
</Configuration>

Looking at it from the point of view of the regex syntax, it also can't tell whether the final " is part of the regex, or the end of it. Regex syntax usually solves this problem with an explicit terminator eg:

/[^"<]/

If users used this syntax (with the terminator), and you wrote a parser for it, then you could determine when the regex has ended, and therefore that the next " character is not part of the regex, but part of the XML, and therefore which parts need to be escaped. I'm not saying you should this! I'm saying it's theoretically possible. It's pretty far from quick and dirty.

BTW: The same problem arises for text within an element. The following is legitimate, within the terms you've given, but has the same parsing problems:

<Configuration>
  <Expression></Expression></Expression>
</Configuration>

The basic rule in a syntax that allows "any text" is that the delimiter must be escaped, (e.g. " or <), so that the end can be recognized. Most syntax also escapes a bunch of other stuff, for convenience/inconvenience. (EDIT it will need to have an escape for the escape character itself: for XML, it is "&", which when literal is escaped as "&amp;" For regex, it is the C/unix-style "\", which when literal is escaped as "\\").

Nest syntaxes, and you're in escape-hell.

One simple solution for you is to tell your users: this is a quick and dirty configuration editor, so you're not getting any fancy "no need to escape" mamby-pamby:

  • List the characters and escapes next to the text area, eg: "<" as "&lt".
  • For XML that won't validate, show them the list again.


Looking back, I see bobince gave the same basic answer before me.

13ren
Pretty much. I'm still left with escaping user input. The issue I'm having is that its not like I can use proper xml parsers/objects and "help" them when they encounter invalid xml; its an all or nothing proposal. I have to regex replace to get it to work; I want to know if there are other ways.
Will
You have to parse in some way. Regex replace is the easiest, but I hope I've shown that you need to define the text content in such a way that you can determine where the text ends (as you can no longer rely on the " and < of XML syntax to do this for you), and it's hard to get the regex right.
13ren
I'm already doing this. I'm still hoping there might be a better way...
Will
+1  A: 

Inserting CDATA around all text would give you another escape mechanism that would (1) save users from manually escaping, and (2) enable the text that was automatically unescaped by the textarea to be read back correctly.

 <Configuration>
   <Validator Expression="<![CDATA[  [^<]   ]]>" />
 </Configuration>

:-)

13ren