views:

1889

answers:

6

What regular expression can identify double quotes outside of HTML tags (which already will be validated) to escape them to "?

A: 

would this work?

\"(?!\s*\w*>)
Most certainly not. Have you tried on actual examples?
Tomalak
+5  A: 

There is no such regular expression.

<p>
  <!-- <a href=" --> is this outside " a tag <!-- "> foo </a> --> or not?
</p>

If you want to do this, you'll unfortunately have to use an HTML parser. Since you have already validated the HTML, you probably already have a parser to use.

derobert
Oh now now, I'm sure an expression for it exists. Whether you have to map-reduce it on a Beowulf cluster to actually perform the thing, before we enter the next ice age, is another question :P
brianreavis
@brianreavis: I realizing you're jesting, but, actually: http://en.wikipedia.org/wiki/Regular_language ... its actually impossible, provable mathematically.
derobert
+1  A: 

Don't use regex for this, use (or write) a parser.

The following code assumes that the input HTML string is well formed (as you stated). Be warned that the code will break if it encounters invalid input!

If you can't be sure of the well-formedness, you can give PHP Tidy a try.

<?php
$html = '<tag>text "text"<tag attr="value"><!-- "text" --> text</tag> "text".';
echo html_escape_quotes($html);

/* Parses input HTML and escapes any literal double quotes 
   in the text content with &quot;. Leaves comments alone.  */
function html_escape_quotes($html)
{
  $output = "";
  $length = strlen($html);
  $delim  = "<";
  $offset = 0;
  while ($offset < $length) {
    $tokpos = strpos($html, $delim, $offset);
    if ($tokpos === false) $tokpos = $length;

    $token  = substr($html, $offset, $tokpos - $offset);
    $offset = $tokpos;

    if ($delim == "<") {
      $token = str_replace('"', '&quot;', $token);
      $delim = substr($html, $offset, 4) == "<!--" ? "-->" : ">";
    } else {
      $delim = "<";
    }

    $output .= $token;
  }
  return $output;
}
?>
Tomalak
This doesn’t work if an attribute contains a `>`. This may not be common but it’s valid and therefore possible.
Gumbo
Hm... I would expect it to be escaped regardless. But you are right, in theory it is possible. +1 for the comment.
Tomalak
Having played around a bit with regex, I believe the following expression finds the valid end of the tag (always assuming valid HTML): /[^"<>]+((?:"[^"]*"|'[^']*')*[^">]*)*?>/ -- What do you think?
Tomalak
A: 

Its possible.

You might be able to do it with regexp, with somethings similar to bellow. You will have to run it multiple times tho, as this regex replaces only 1 " with ' between tags.

Serach: (\<.+?\>.+?)(")(.+?\</.+?\>)
Replace: $1'$3

But, the better approach would be to utilise callbacks to do function replacement. Just create RE that sends content of tags to the function, which can then simply replace " with whatever you want.

See more info here. Search for callback. As derobert noted, you might need to remove comments before that :)

majkinetor
Even if you remove comments (how? with a parser, I assume), you're left with all kinds of fun with e.g., quoted strings. < and > are valid inside quoted strings. And I didn't even mention <script>, <style>, and <plaintext> blocks.
derobert
Ye... you are right.
majkinetor
A: 

You could try to split the string and separate the tags from the text data with this expression:

<(?:\?[^?]+\?>|[A-Za-z]+(?:[^">]+|"[^"]*")*|!(?:\[CDATA\[(?:[^\]]+|](?:[^\]]|][^>]))*]]|--(?:[^-]+|-(?!->))*--))>

This will (hopefully) match any XML PI, element tag, CDATA and comment block.

So:

$parts = preg_split('/(<(?:\?[^?]+\?>|[A-Za-z]+(?:[^">]+|"[^"]*")*|!(?:\[CDATA\[(?:[^\]]+|](?:[^\]]|][^>]))*]]|--(?:[^-]+|-(?!->))*--))>)/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
$str = '';
foreach ($parts as $part) {
    if ($part[0] == '<') {
        $str .= $part;
    } else {
        $str .= str_replace('"', '&quot;', $part);
    }
}

But I doubt that this is very efficient. A real parser would be moreefficient and correct.

Gumbo
A: 

Not the best (not works in all situations) but good enough for me:

function quoting(&$data) {
 $quot  = '(["\x93\x94\x84]|\&#8220;|\&#8222;|\&#8221;|\&ldquo;|\&bdquo;|\&rdquo;|\&quo;|\&#34;)';
 $parse = '<q>$2</q>';
 $data  = preg_replace('/="([^"]*)"/', '*%Q:$1%*', $data);
 $data  = preg_replace("/$quot(.*?)$quot/", $parse, $data);
 $data  = preg_replace('/\*%Q:(.*?)%\*/', '="$1"', $data);
}