




I'm trying to use (.+?) to isolate the words "I. NEED. ISOLATION" in the source below:

    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      I. NEED. ISOLATION  </font> </td>

using (.+?), I could do this:

$regex = '/stuff before(.+?)stuff after/';

and for this html, that would be:

$regex = '/<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      (.+?)  </font> </td>/';

but it's choking up on it because of incorrect escaping. I'm not great in PHP. Can someone please advise which characters I should also escape based on html that looks like this?

    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      I. NEED. ISOLATION  </font> </td>

Note that I'm not trying to design a regex pattern. I already have the pattern nailed down with (.+?), just need to know how to correctly escape the html so that php doesn't choke up on it.

+3  A: 

See this previous StackOverflow question.

That said, the escaping issue is due to the / characters within, which are confusing the regex parser since you're using /es already to delimit the regex.

+1 namely, the world-famous accepted answer to it. :)
That's a lot of info, but I'm trying to use `(.+?)` to isolate the string I need, so I'm not really trying to match all those patterns. Just trying to figure out what should be escaped and how. Thanks.
Yes, we realize that (and I mentioned what you need to escape in my answer). However, we're also giving you a helpful suggestion that you may want to consider alternatives to regex - it can often avoid hassles in the future if you need to expand beyond a limited case.
+2  A: 

First of all, you should really not use regular expressions to try to "parse" HTML -- which is not quite regular.

Going with something like DOMDocument::loadHTML and some XPath query is generally a much better solution.

But, if you really want to go with a regex (and it seems you do, judging from your comments to other answers), I suppose you should not use / as regex delimiter : there are too many slashed in HTML already -- it'll be an escaping hell, as you already noticed.

For instance, you could use a # as regex delimiter :

$str = <<<STR
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      I. NEED. ISOLATION  </font> </td>
$regex = '#<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      (.+?)  </font> </td>#';
if (preg_match($regex, $str, $m)) {

Will get you :

string 'I. NEED. ISOLATION' (length=18)

Note the only thing I changed compared to your proposed code is the regex delimiter ;-)

And, using a character that's not present in the HTML string, I don't have anything to escape -- especially, I don't have to escape all the /s -- which means the regex is far more easy to both write, read, and understand.


If you’re using PCRE regular expressions, you need to escape the delimiters inside the regular expression (in your case the /):

<td valign="top" width="82%"> <font face="Arial" size="2"> 
  (.+?)  <\/font> <\/td>/'

But probably more important: Regular expressions are not suitable for parsing HTML. Better use a proper HTML parser like the one provided by DOMDocument and query it with DOMXPath.

    <td valign="top" width="82%"> <font face="Arial" size="2">
      I. NEED. ISOLATION  </font> </td>

$s = explode("</font>",$str);
foreach($s as $k=>$v){
    if(strpos($v,'<font face="Arial" size="2">')){
        $t=explode('<font face="Arial" size="2">',$v);
        print trim($t[1])."\n";


$ php test.php

There is a funciton that does that for you. It's named preg_quote http://pl2.php.net/preg_quote

$regex = '/'.preg_quote('<strong>Label:</strong></font></td>
<td valign="top" width="82%"> <font face="Arial" size="2"> 
  ').'(.+?)'.preg_quote('  </font> </td>).'/';

You should also be careful with case sensitivity and line breaks. I often tend to add flags to my regexps to deal with it so they look like /(.+?)/is

Kamil Szot

As a matter of fact, there's nothing in that string that has special meaning in a regex (except the (.+?), of course). The only reason the / is causing a problem is because you're using it as the regex delimiter. You just need to choose a different delimiter, like ~ for example:

$regex = '~<strong>Label:</strong></font></td>
    <td valign="top" width="82%"> <font face="Arial" size="2"> 
      (.+?)  </font> </td>~';
Alan Moore