views:

161

answers:

2

Imagine the following.

  1. Html is parsed into a dom tree
  2. Dom Nodes become available programmatically
  3. Dom Nodes may-or-may-not be augmented programmatically
  4. Augmented nodes are reserialised to html.

I have primarily a question on how one would want the "script" tag to behave.

my $tree = someparser( $source ); 
....
print $somenode->text(); 
$somenode->text('arbitraryjavascript');
....
print $tree->serialize();

Or to that effect.

The problem occurs when deciding how to appropriately treat the contents of this field in regards to ease of use, and portability/usability of its emissions.

What I'm wanting to do myself is this:

 $somenode->text("verbatim");

-->

  <script>
  // <!-- <![CDATA[ 
  verbatim
  // ]]> -->
  </script>

So that what i produce is both somewhat safe, and validation friendly.

But I'm indecisive if doing this magically is a good idea, and whether or not I should have code that tries to detect existing copies of 'safety blocks' and replace them/strip them on the 'parse' phase.

If I don't strip it from input, I'm likely going to double up on the output phase, especially problematic if the output of this code is later wanted to be re-parsed.

If i strip it from input It will have the beneficial effect that programmatically fetching the content of the script element wont see the safety blocks at either end.

Ultimately there will be a way of toggling out some of this behaviour, but the question is what the /default/ way of handling this should be, and why.

Its possible my entire reasoning is flawed here and the text contents should go totally unprocessed unless wanted to be processed.

What behaviour do you look for in such a tool? Please point out anything in reasoning I may have overlooked.


TLDR Summary: How should i programmatically handle the escaping mechanism in these scripts, namely the '//<!--<![CDATA[' safey padding at either end, with respect to input/output

A: 

The only thing similar I can think of is in ASP.NET's register script block functions. They all have an overload that takes a bool for whether script tags should be added or not.

Here's a link to the docs for one:

http://msdn.microsoft.com/en-us/library/bahh2fef.aspx

Lou Franco
+1  A: 

I'm adding my own answer here, so its more obvious what I'm trying to find out. The current idea I have settled on would perform as follows:

my $html=<<'EOF'
<script>
//<!--<![CDATA[
foo
//]]>-->
</script>
EOF
#/# this line is here for the syntax highlighter
my $obj = parse($html); 
print $obj->text(); 
# foo
$obj->text("bar");
print $obj->text(); 
# bar
print $obj->html(); 
# <script>
# //<!--<![CDATA[
# bar
# //]]>-->
# </script>

Important points being:

  1. The xml/html/legacybrowser/bot protection mechanisms are removed for the internal code view.
  2. Inline code can thus be manipulated as if they were not there.
  3. Re-exporting modified code puts the protection mechanisms back on.

if there was

  1. No Protection mechanisms
  2. Different ( ie: no // , or no <!-, or no <! ) parts

the existing protections would be stripped and replaced with the bog-standard specified above.

Kent Fredric