What regular expression can identify double quotes outside of HTML tags (which already will be validated) to escape them to "
?
views:
1889answers:
6There is no such regular expression.
<p>
<!-- <a href=" --> is this outside " a tag <!-- "> foo </a> --> or not?
</p>
If you want to do this, you'll unfortunately have to use an HTML parser. Since you have already validated the HTML, you probably already have a parser to use.
Don't use regex for this, use (or write) a parser.
The following code assumes that the input HTML string is well formed (as you stated). Be warned that the code will break if it encounters invalid input!
If you can't be sure of the well-formedness, you can give PHP Tidy a try.
<?php
$html = '<tag>text "text"<tag attr="value"><!-- "text" --> text</tag> "text".';
echo html_escape_quotes($html);
/* Parses input HTML and escapes any literal double quotes
in the text content with ". Leaves comments alone. */
function html_escape_quotes($html)
{
$output = "";
$length = strlen($html);
$delim = "<";
$offset = 0;
while ($offset < $length) {
$tokpos = strpos($html, $delim, $offset);
if ($tokpos === false) $tokpos = $length;
$token = substr($html, $offset, $tokpos - $offset);
$offset = $tokpos;
if ($delim == "<") {
$token = str_replace('"', '"', $token);
$delim = substr($html, $offset, 4) == "<!--" ? "-->" : ">";
} else {
$delim = "<";
}
$output .= $token;
}
return $output;
}
?>
Its possible.
You might be able to do it with regexp, with somethings similar to bellow. You will have to run it multiple times tho, as this regex replaces only 1 " with ' between tags.
Serach: (\<.+?\>.+?)(")(.+?\</.+?\>)
Replace: $1'$3
But, the better approach would be to utilise callbacks to do function replacement. Just create RE that sends content of tags to the function, which can then simply replace " with whatever you want.
See more info here. Search for callback. As derobert noted, you might need to remove comments before that :)
You could try to split the string and separate the tags from the text data with this expression:
<(?:\?[^?]+\?>|[A-Za-z]+(?:[^">]+|"[^"]*")*|!(?:\[CDATA\[(?:[^\]]+|](?:[^\]]|][^>]))*]]|--(?:[^-]+|-(?!->))*--))>
This will (hopefully) match any XML PI, element tag, CDATA and comment block.
So:
$parts = preg_split('/(<(?:\?[^?]+\?>|[A-Za-z]+(?:[^">]+|"[^"]*")*|!(?:\[CDATA\[(?:[^\]]+|](?:[^\]]|][^>]))*]]|--(?:[^-]+|-(?!->))*--))>)/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
$str = '';
foreach ($parts as $part) {
if ($part[0] == '<') {
$str .= $part;
} else {
$str .= str_replace('"', '"', $part);
}
}
But I doubt that this is very efficient. A real parser would be moreefficient and correct.
Not the best (not works in all situations) but good enough for me:
function quoting(&$data) {
$quot = '(["\x93\x94\x84]|\“|\„|\”|\“|\„|\”|\&quo;|\")';
$parse = '<q>$2</q>';
$data = preg_replace('/="([^"]*)"/', '*%Q:$1%*', $data);
$data = preg_replace("/$quot(.*?)$quot/", $parse, $data);
$data = preg_replace('/\*%Q:(.*?)%\*/', '="$1"', $data);
}