tags:

views:

86

answers:

4

I wrote a pretty simple preg_match_all file in PHP:

$fileName = 'A_DATED_FILE_091410.txt';
$matches = array();
preg_match_all('/[0-9][0-9]/',$fileName,$matches);
print_r($matches);

My Expected Output:

$matches = array(
    [0] => array(
        [0] => 09,
        [1] => 91,
        [2] => 14,
        [3] => 41,
        [4] => 10
    )
)

What I got instead:

$matches = array(
    [0] => array(
        [0] => 09,
        [1] => 14,
        [2] => 10
    )
)

Now, in this particular use case this was preferable, but I'm wondering why it didn't match the other substrings? Also, is a regex possible that would give me my expected output, and if so, what is it?

+2  A: 

The search for the next match starts at the first character after the previous match. So when 09 is matched in 091410, the search for the next match starts at 1410.

Gumbo
+7  A: 

With a global regex (which is what preg_match_all uses), once a match is made, the regex engine continues searching the string from the end of the previous match.

In your case, the regular expression engine starts at the beginning of the string, and advances until the 0, since that is the first character that matches [0-9]. It then advances to the next position (9), and since that matches the second [0-9], it takes 09 as a match. When the engine continues matching (since it has not yet reached the end of the string), it advances its position again (to 1) (and then the above repeats).

See also: First Look at How a Regex Engine Works Internally


If you must get every 2 digit sequence, you can use preg_match and use offsets to determine where to start capturing from:

$fileName = 'A_DATED_FILE_091410.txt';
$allSequences = array();
$matches = array();
$offset = 0;

while (preg_match('/[0-9][0-9]/', $fileName, $matches, PREG_OFFSET_CAPTURE, $offset))
{
  list($match, $offset) = $matches[0];
  $allSequences[] = $match;
  $offset++; // since the match is 2 digits, we'll start the next match after the first
}

Note that the offset returned with the PREG_OFFSET_CAPTURE flag is the start of the match.


I've got another solution that will get five matches without having to use offsets, but I'm adding it here just for curiosity, and I probably wouldn't use it myself in production code (it's a somewhat complex regex too). You can use a regex that uses a lookbehind to look for a number before the current position, and captures the number in the lookbehind (in general, lookarounds are non-capturing):

(?<=([0-9]))[0-9]

Let's walk through this regex:

(?<=       # open a positive lookbehind
  (        # open a capturing group
    [0-9]  # match 0-9
  )        # close the capturing group
)          # close the lookbehind
[0-9]      # match 0-9

Because lookarounds are zero-width and do not move the regex position, this regular expression will match 5 times: the engine will advance until the 9 (because that is the first position which satisfies the lookbehind assertion). Since 9 matches [0-9], the engine will take 9 as a match (but because we're capturing in the lookaround, it'll also capture the 0!). The engine then moves to the 1. Again, the lookbehind succeeds (and captures), and the 1 is added as a 1st subgroup match (and so on, until the engine hits the end of the string).

When we give this pattern to preg_match_all, we'll end up with an array that looks like (using the PREG_SET_ORDER flag to group capturing groups along with the full match):

Array
(
    [0] => Array
        (
            [0] => 9
            [1] => 0
        )

    [1] => Array
        (
            [0] => 1
            [1] => 9
        )

    [2] => Array
        (
            [0] => 4
            [1] => 1
        )

    [3] => Array
        (
            [0] => 1
            [1] => 4
        )

    [4] => Array
        (
            [0] => 0
            [1] => 1
        )

)

Note that each "match" has its digits out of order! This is because the capture group in the lookbehind becomes backreference 1 while the whole match is backreference 0. We can put it back together in the correct order though:

preg_match_all('/(?<=([0-9]))[0-9]/', $fileName, $matches, PREG_SET_ORDER);
$allSequences = array();
foreach ($matches as $match)
{
  $allSequences[] = $match[1] . $match[0];
}
Daniel Vandersluis
So, if that's how php works with regexes, I take it that means getting an array of every consecutive 2 number pair like my expected output isn't possible then?
GSto
@GSto I've updated my answer.
Daniel Vandersluis
+1  A: 

Also, is a regex possible that would give me my expected output, and if so, what is it?

No single one will work because it won't match the same section twice. But you could do something like this:

$i = 0;
while (preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, $i))
{
  $i = $matches[0][1]; /* + 1 in many cases */
}

The above is not safe for the general case. You could get stuck in an infinite loop, depending on the pattern. Also, you may not want [0][1], but instead something like [1][1] etc, again, depending on the pattern.

For this particular case, I think it would be much simpler to do it yourself:

$l = strlen($s);
$prev_digit = false;
for ($i = 0; $i < $l; ++$i)
{
  if ($s[$i] >= '0' && $s[$i] <= '9')
  {
    if ($prev_digit) { /* found match */ }
    $prev_digit = true;
  }
  else
    $prev_digit = false;
}
konforce
+1  A: 

Just for fun, another way to do it :

 <?php
 $fileName = 'A_DATED_FILE_091410.txt';
 $matches = array();
 preg_match_all('/(?<=([0-9]))[0-9]/',$fileName,$matches);
 $result = array();
 foreach($matches[1] as $i => $behind)
 {
     $result[] = $behind . $matches[0][$i];
 }
 print_r($result);
 ?>
VirtualBlackFox
+1 since you beat me to my second solution :)
Daniel Vandersluis
As you said in your post that's far from production ready and a pretty big misuse of the the lookbehind feature but it work :D Nice to see i'm not the only one having thought of this solution, but your explanation of it is far better than my raw code dump.
VirtualBlackFox