tags:

views:

118

answers:

3

I've stumbled across an interesting bug in PHP. Basically I have a regular expression seen below which works fine in one script (Script A) but fails to work when put into a class and used in a script (Script B).

I have tested this script on PHP 5.3, and 5.2.

Script A:
http://iamdb.googlecode.com/svn/trunk/testing.php

Script B:
Class the regex is used in: http://iamdb.googlecode.com/svn/trunk/imdb/search/imdb_search_title.class.php
Script calling it: http://iamdb.googlecode.com/svn/trunk/examples/Search_Debug.php

Regular Expression:

"#<br> aka <em>\"([^\"]*)\"</em>(?: -?,? ([^ ]*) (?:<em>\(([^\)]*)\)</em>)*)*#i"

Thanks.

As requested, here is some example output from Script B...

Array
(
    [0] => Array
        (
        )

    [1] => Array
        (
        )

    [2] => Array
        (
        )

    [3] => Array
        (
        )

    [INPUT] => <small>(TV series)</small>    <br>aka <em>"Hammer Time"</em> - USA <em>(working title)</em>
)

The numbered keys are from the preg_match_all call and the INPUT key is added afterwards to show the input string.

+2  A: 

Looking at the debugger, the subject of the preg_replace_alls don't match between the class and the test.php case.

From the test case:

<small>(TV series)</small>    <br> aka <em>"Sledge Hammer: The Early Years"</em> - USA <em>(second season title)</em>

The actual subject when called from the class:

<small>(TV series)</small>    <br>aka <em>"Hammer Time"</em> - USA <em>(working title)</em>

There's no space between the <br> and the aka. Take that space out of the regex and it works.

Otterfan
Many thanks!
Andrew
+1  A: 

There's nothing wrong with the regex or embedding it in a class. You're convincing yourself that your test situations are equivalent when they're not. In the immediate case, the string you're sending the class version,

<small>(TV series)</small>    <br>aka <em>"Hammer Time"</em> - USA <em>(working title)</em>

isn't matched by the regex because the regex requires exactly one space between the <br> and the aka. This revision of it works:

const REGEX_AKA = "#<br>\s*aka <em>\"([^\"]*)\"</em>(?: (?:-?)(?:,?) ([^ ]*) (?:<em>\(([^\)]*)\)</em>)*)*#i";
chaos
A: 

Are you trying to match against an actual search-result page on IMDB, like this one? On that page, the "<br>" and the "aka" are always separated by an entity reference for a non-breaking space:

<br>&#160;aka <em>

I don't know if it's always that way; you might want allow for multiple kinds and representations of whitepsace, like this:

<br>(?:&(?:#(?:160|xA0)|nbsp);|\xA0|\s)*+aka

i.e., zero or more of: an entity reference for an NBSP (decimal, hexadecimal or named); a real NBSP; or a standard whitespace character.

Alan Moore
Yeah, I decode the HTML entities before running through my regular expressions.
Andrew