ansaurus

Question

PHP: Escape Quotes ONLY outside of HTML tags (Regex)

Answer 1

A:

would this work?

\"(?!\s*\w*>)

2009-04-17 06:26:01

Most certainly not. Have you tried on actual examples?

Tomalak 2009-04-17 06:48:20

Answer 2

+5 A:

There is no such regular expression.

<p>
  <!-- <a href=" --> is this outside " a tag <!-- "> foo </a> --> or not?
</p>

If you want to do this, you'll unfortunately have to use an HTML parser. Since you have already validated the HTML, you probably already have a parser to use.

derobert 2009-04-17 06:33:28

Oh now now, I'm sure an expression for it exists. Whether you have to map-reduce it on a Beowulf cluster to actually perform the thing, before we enter the next ice age, is another question :P

brianreavis 2009-09-18 13:11:37

@brianreavis: I realizing you're jesting, but, actually: http://en.wikipedia.org/wiki/Regular_language ... its actually impossible, provable mathematically.

derobert 2009-09-19 17:03:56

Answer 3

+1 A:

Don't use regex for this, use (or write) a parser.

The following code assumes that the input HTML string is well formed (as you stated). Be warned that the code will break if it encounters invalid input!

If you can't be sure of the well-formedness, you can give PHP Tidy a try.

<?php
$html = '<tag>text "text"<tag attr="value"><!-- "text" --> text</tag> "text".';
echo html_escape_quotes($html);

/* Parses input HTML and escapes any literal double quotes 
   in the text content with &quot;. Leaves comments alone.  */
function html_escape_quotes($html)
{
  $output = "";
  $length = strlen($html);
  $delim  = "<";
  $offset = 0;
  while ($offset < $length) {
    $tokpos = strpos($html, $delim, $offset);
    if ($tokpos === false) $tokpos = $length;

    $token  = substr($html, $offset, $tokpos - $offset);
    $offset = $tokpos;

    if ($delim == "<") {
      $token = str_replace('"', '&quot;', $token);
      $delim = substr($html, $offset, 4) == "<!--" ? "-->" : ">";
    } else {
      $delim = "<";
    }

    $output .= $token;
  }
  return $output;
}
?>

Tomalak 2009-04-17 06:46:20

This doesn’t work if an attribute contains a `>`. This may not be common but it’s valid and therefore possible.

Gumbo 2009-04-17 11:26:25

Hm... I would expect it to be escaped regardless. But you are right, in theory it is possible. +1 for the comment.

Tomalak 2009-04-17 11:54:25

Having played around a bit with regex, I believe the following expression finds the valid end of the tag (always assuming valid HTML): /[^"<>]+((?:"[^"]*"|'[^']*')*[^">]*)*?>/ -- What do you think?

Tomalak 2009-04-17 11:56:00

Answer 4

A:

Its possible.

You might be able to do it with regexp, with somethings similar to bellow. You will have to run it multiple times tho, as this regex replaces only 1 " with ' between tags.

Serach: (\<.+?\>.+?)(")(.+?\</.+?\>)
Replace: $1'$3

But, the better approach would be to utilise callbacks to do function replacement. Just create RE that sends content of tags to the function, which can then simply replace " with whatever you want.

See more info here. Search for callback. As derobert noted, you might need to remove comments before that :)

majkinetor 2009-04-17 10:45:01

Even if you remove comments (how? with a parser, I assume), you're left with all kinds of fun with e.g., quoted strings. < and > are valid inside quoted strings. And I didn't even mention <script>, <style>, and <plaintext> blocks.

derobert 2009-04-17 19:50:41

Ye... you are right.

majkinetor 2009-04-17 22:58:27

Answer 5

A:

You could try to split the string and separate the tags from the text data with this expression:

<(?:\?[^?]+\?>|[A-Za-z]+(?:[^">]+|"[^"]*")*|!(?:\[CDATA\[(?:[^\]]+|](?:[^\]]|][^>]))*]]|--(?:[^-]+|-(?!->))*--))>

This will (hopefully) match any XML PI, element tag, CDATA and comment block.

So:

$parts = preg_split('/(<(?:\?[^?]+\?>|[A-Za-z]+(?:[^">]+|"[^"]*")*|!(?:\[CDATA\[(?:[^\]]+|](?:[^\]]|][^>]))*]]|--(?:[^-]+|-(?!->))*--))>)/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
$str = '';
foreach ($parts as $part) {
    if ($part[0] == '<') {
        $str .= $part;
    } else {
        $str .= str_replace('"', '&quot;', $part);
    }
}

But I doubt that this is very efficient. A real parser would be moreefficient and correct.

Gumbo 2009-04-17 11:17:05

Answer 6

A:

Not the best (not works in all situations) but good enough for me:

function quoting(&$data) {
 $quot  = '(["\x93\x94\x84]|\&#8220;|\&#8222;|\&#8221;|\&ldquo;|\&bdquo;|\&rdquo;|\&quo;|\&#34;)';
 $parse = '<q>$2</q>';
 $data  = preg_replace('/="([^"]*)"/', '*%Q:$1%*', $data);
 $data  = preg_replace("/$quot(.*?)$quot/", $parse, $data);
 $data  = preg_replace('/\*%Q:(.*?)%\*/', '="$1"', $data);
}

2009-09-18 13:02:21

ansaurus

tags:

views:

answers:

PHP: Escape Quotes ONLY outside of HTML tags (Regex)

related questions