tags:

views:

635

answers:

6

Suppose I have the following two strings containing regular expressions. How do I coalesce them? More specifically, I want to have the two expressions as alternatives.

$a = '# /[a-z] #i';
$b = '/ Moo /x';
$c = preg_magic_coalesce('|', $a, $b);
// Desired result should be equivalent to:
// '/ \/[a-zA-Z] |Moo/'

Of course, doing this as string operations isn't practical because it would involve parsing the expressions, constructing syntax trees, coalescing the trees and then outputting another regular expression equivalent to the tree. I'm completely happy without this last step. Unfortunately, PHP doesn't have a RegExp class (or does it?).

Is there any way to achieve this? Incidentally, does any other language offer a way? Isn't this a pretty normal scenario? Guess not. :-(

Alternatively, is there a way to check efficiently if either of the two expressions matches, and which one matches earlier (and if they match at the same position, which match is longer)? This is what I'm doing at the moment. Unfortunately, I do this on long strings, very often, for more than two patterns. The result is slow (and yes, this is definitely the bottleneck).

EDIT:

I should have been more specific – sorry. $a and $b are variables, their content is outside of my control! Otherwise, I would just coalesce them manually. Therefore, I can't make any assumptions about the delimiters or regex modifiers used. Notice, for example, that my first expression uses the i modifier (ignore casing) while the second uses x (extended syntax). Therefore, I can't just concatenate the two because the second expression does not ignore casing and the first doesn't use the extended syntax (and any whitespace therein is significant!

+1  A: 

I'm pretty sure it's not possible to just put regexps together like that in any language - they could have incompatible modifiers.

I'd probably just put them in an array and loop through them, or combine them by hand.

Edit: If you're doing them one at a time as described in your edit, you maybe be able to run the second one on a substring (from the start up to the earliest match). That might help things.

Greg
Unfortunately, running consecutive expressions only on the substring doesn't work because they might *start* before the other expression, but *finish* after it. If I just test on a prefix, this situation won't get caught.
Konrad Rudolph
A: 
function preg_magic_coalasce($split, $re1, $re2) {
  $re1 = rtrim($re1, "\/#is");
  $re2 = ltrim($re2, "\/#");
  return $re1.$split.$re2;
}
tj111
Unfortunately this won't work - by removing the is modifiers you've changed the meaning of $re1
Greg
A: 

You could do it the alternative way like this:

$a = '# /[a-z] #i';
$b = '/ Moo /x';

$a_matched = preg_match($a, $text, $a_matches);
$b_matched = preg_match($b, $text, $b_matches);

if ($a_matched && $b_matched) {
    $a_pos = strpos($text, $a_matches[1]);
    $b_pos = strpos($text, $b_matches[1]);

    if ($a_pos == $b_pos) {
        if (strlen($a_matches[1]) == strlen($b_matches[1])) {
            // $a and $b matched the exact same string
        } else if (strlen($a_matches[1]) > strlen($b_matches[1])) {
            // $a and $b started matching at the same spot but $a is longer
        } else {
            // $a and $b started matching at the same spot but $b is longer
        }
    } else if ($a_pos < $b_pos) {
        // $a matched first
    } else {
        // $b matched first
    }
} else if ($a_matched) {
    // $a matched, $b didn't
} else if ($b_matched) {
    // $b matched, $a didn't
} else {
    // neither one matched
}
yjerem
Jeremy, thanks for the suggestion but `preg_match` has a flag (PREG_OFFSET_CAPTURE) that results in the match position being recorded. So *this isn't a problem*. It's just darn inefficient (although still much more efficient than your code).
Konrad Rudolph
+3  A: 
  1. Strip delimiters and flags from each. This regex should do it:

    /^(.)(.*)\1([imsxeADSUXJu]*)$/
    
  2. Join expressions together. You'll need non-capturing parenthesis to inject flags:

    "(?$flags1:$regexp1)|(?$flags2:$regexp2)"
    
  3. If there are any back references, count capturing parenthesis and update back references accordingly (e.g. properly joined /(.)x\1/ and /(.)y\1/ is /(.)x\1|(.)y\2/ ).

porneL
+3  A: 
eyelidlessness
+3  A: 

EDIT

I’ve rewritten the code! It now contains the changes that are listed as follows. Additionally, I've done extensive tests (which I won’t post here because they’re too many) to look for errors. So far, I haven’t found any.

  • The function is now split into two parts: There’s a separate function preg_split which takes a regular expression and returns an array containing the bare expression (without delimiters) and an array of modifiers. This might come in handy (it already has, in fact; this is why I made this change).

  • The code now correctly handles back-references. This was necessary for my purpose after all. It wasn’t difficult to add, the regular expression used to capture the back-references just looks weird (and may actually be extremely inefficient, it looks NP-hard to me – but that’s only an intuition and only applies in weird edge cases). By the way, does anyone know a better way of checking for an uneven number of matches than my way? Negative lookbehinds won't work here because they only accept fixed-length strings instead of regular expressions. However, I need the regex here to test whether the preceeding backslash is actually escaped itself.

    Additionally, I don’t know how good PHP is at caching anonymous create_function use. Performance-wise, this might not be the best solution but it seems good enough.

  • I’ve fixed a bug in the sanity check.

  • I’ve removed the cancellation of obsolete modifiers since my tests show that it isn't necessary.

By the way, this code is one of the core components of a syntax highlighter for various languages that I’m working on in PHP since I’m not satisfied with the alternatives listed elsewhere.

Thanks!

porneL, eyelidlessness, amazing work! Many, many thanks. I had actually given up.

I've built upon your solution and I'd like to share it here. I didn't implement re-numbering back-references since this isn't relevant in my case (I think …). Perhaps this will become necessary later, though.

Some Questions …

One thing, @eyelidlessness: Why do you feel the necessity to cancel old modifiers? As far as I see it, this isn't necessary since the modifiers are only applied locally anyway. Ah yes, one other thing. Your escaping of the delimiter seems overly complicated. Care to explain why you think this is needed? I believe my version should work as well but I could be very wrong.

Also, I've changed the signature of your function to match my needs. I also thing that my version is more generally useful. Again, I might be wrong.

BTW, you should now realize the importance of real names on SO. ;-) I can't give you real credit in the code. :-/

The Code

Anyway, I'd like to share my result so far because I can't believe that nobody else ever needs something like that. The code seems to work very well. Extensive tests are yet to be done, though. Please comment!

And without further ado …

/**
 * Merges several regular expressions into one, using the indicated 'glue'.
 *
 * This function takes care of individual modifiers so it's safe to use
 * <em>different</em> modifiers on the individual expressions. The order of
 * sub-matches is preserved as well. Numbered back-references are adapted to
 * the new overall sub-match count. This means that it's safe to use numbered
 * back-refences in the individual expressions!
 * If {@link $names} is given, the individual expressions are captured in
 * named sub-matches using the contents of that array as names.
 * Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently
 * <strong>not</strong> supported.
 *
 * The function assumes that all regular expressions are well-formed.
 * Behaviour is undefined if they aren't.
 *
 * This function was created after a {@link http://stackoverflow.com/questions/244959/
 * StackOverflow discussion}. Much of it was written or thought of by
 * “porneL” and “eyelidlessness”. Many thanks to both of them.
 *
 * @param string $glue  A string to insert between the individual expressions.
 *      This should usually be either the empty string, indicating
 *      concatenation, or the pipe (<code>|</code>), indicating alternation.
 *      Notice that this string might have to be escaped since it is treated
 *      like a normal character in a regular expression (i.e. <code>/</code>)
 *      will end the expression and result in an invalid output.
 * @param array $expressions    The expressions to merge. The expressions may
 *      have arbitrary different delimiters and modifiers.
 * @param array $names  Optional. This is either an empty array or an array of
 *      strings of the same length as {@link $expressions}. In that case,
 *      the strings of this array are used to create named sub-matches for the
 *      expressions.
 * @return string An string representing a regular expression equivalent to the
 *      merged expressions. Returns <code>FALSE</code> if an error occurred.
 */
function preg_merge($glue, array $expressions, array $names = array()) {
    // … then, a miracle occurs.

    // Sanity check …

    $use_names = ($names !== null and count($names) !== 0);

    if (
        $use_names and count($names) !== count($expressions) or
        !is_string($glue)
    )
        return false;

    $result = array();
    // For keeping track of the names for sub-matches.
    $names_count = 0;
    // For keeping track of *all* captures to re-adjust backreferences.
    $capture_count = 0;

    foreach ($expressions as $expression) {
        if ($use_names)
            $name = str_replace(' ', '_', $names[$names_count++]);

        // Get delimiters and modifiers:

        $stripped = preg_strip($expression);

        if ($stripped === false)
            return false;

        list($sub_expr, $modifiers) = $stripped;

        // Re-adjust backreferences:

        // We assume that the expression is correct and therefore don't check
        // for matching parentheses.

        $number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);

        if ($number_of_captures === false)
            return false;

        if ($number_of_captures > 0) {
            // NB: This looks NP-hard. Consider replacing.
            $backref_expr = '/
                (                # Only match when not escaped:
                    [^\\\\]      # guarantee an even number of backslashes
                    (\\\\*?)\\2  # (twice n, preceded by something else).
                )
                \\\\ (\d)        # Backslash followed by a digit.
            /x';
            $sub_expr = preg_replace_callback(
                $backref_expr,
                create_function(
                    '$m',
                    'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'
                ),
                $sub_expr
            );
            $capture_count += $number_of_captures;
        }

        // Last, construct the new sub-match:

        $modifiers = implode('', $modifiers);
        $sub_modifiers = "(?$modifiers)";
        if ($sub_modifiers === '(?)')
            $sub_modifiers = '';

        $sub_name = $use_names ? "?<$name>" : '?:';
        $new_expr = "($sub_name$sub_modifiers$sub_expr)";
        $result[] = $new_expr;
    }

    return '/' . implode($glue, $result) . '/';
}

/**
 * Strips a regular expression string off its delimiters and modifiers.
 * Additionally, normalize the delimiters (i.e. reformat the pattern so that
 * it could have used '/' as delimiter).
 *
 * @param string $expression The regular expression string to strip.
 * @return array An array whose first entry is the expression itself, the
 *      second an array of delimiters. If the argument is not a valid regular
 *      expression, returns <code>FALSE</code>.
 *
 */
function preg_strip($expression) {
    if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)
        return false;

    $delim = $matches[1];
    $sub_expr = $matches[2];
    if ($delim !== '/') {
        // Replace occurrences by the escaped delimiter by its unescaped
        // version and escape new delimiter.
        $sub_expr = str_replace("\\$delim", $delim, $sub_expr);
        $sub_expr = str_replace('/', '\\/', $sub_expr);
    }
    $modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));

    return array($sub_expr, $modifiers);
}

PS: I've made this posting community wiki editable. You know what this means …!

Konrad Rudolph
Hey, I'm famous! ;) See my edit, I think it addresses the modifier/cancel-modifier thing; and I'm only escaping the slash in a sub-expression if the original was not in fact a slash, which I see you expressed more clearly in your code.
eyelidlessness