tags:

views:

225

answers:

5

Hello all,

I am hoping the regular expression experts can tell me why this is going wrong:

This regex:

$pattern = '/(?<percent>[0-9]{1,3}\.[0-9]{1,2})% of (?<filesize>.+) at/';

Should match this sort of string:

[download] 87.1% of 4.40M at 107.90k/s ETA 00:05 
[download] 89.0% of 4.40M at 107.88k/s ETA 00:04 
[download] 91.4% of 4.40M at 106.09k/s ETA 00:03 
[download] 92.9% of 4.40M at 105.55k/s ETA 00:03

Correct? Is there anything that could go wrong with that regex that will not get it to match with the above input? Full usage here:

while(!feof($handle))
{
    $progress = fread($handle, 8192);
    $pattern = '/(?<percent>[0-9]{1,3}\.[0-9]{1,2})% of (?<filesize>.+) at/';
    if(preg_match_all($pattern, $progress, $matches)){
    //matched
    }
}

Could how much that is being read by fread be effecting the regex to work correctly?

I really need confirmation as I am trying to identify why it isn't working on a new server. This question is related to Change in Server Permits script not to work. Can this be due to PHP.ini being different?

Thanks all

Update 2

I have made a test script to test the regex but even on its own it doesn't work??

<?php 

error_reporting(E_ALL);

echo 'Start';

$progress = "[download]75.1% of 4.40M at 115.10k/s ETA 00:09 [download] 77.2% of 4.40M at 112.36k/s ETA 00:09 [download] 78.6% of 4.40M at 111.41k/s ETA 00:08 [download] 80.3% of 4.40M at 110.80k/s ETA 00:07 [download] 82.3% of 4.40M at 110.30k/s ETA 00:07 [download] 84.3% of 4.40M at 108.33k/s ETA 00:06 [download] 85.7% of 4.40M at 107.62k/s ETA 00:05 [download] 87.5% of 4.40M at 107.21k/s ETA 00:05 [download] 89.5% of 4.40M at 105.10k/s ETA 00:04 [download] 90.7% of 4.40M at 106.45k/s ETA 00:03 [download] 93.2% of 4.40M at 104.92k/s ETA 00:02 [download] 94.8% of 4.40M at 104.40k/s ETA 00:02 [download] 96.5% of 4.40M at 102.47k/s ETA 00:01 [download] 97.7% of 4.40M at 103.48k/s ETA 00:01 [download] 100.0% of 4.40M at 103.15k/s ETA 00:00 [download] 100.0% of 4.40M at 103.16k/s ETA 00:00
";

$pattern = '/(?<percent>\d{1,3}\.\d{1,2})%\s+of\s+(?<filesize>[\d.]+[kBM]) at/';

if(preg_match_all($pattern, $progress, $matches)){
    echo 'match';
}

echo '<br>Done<br>';    

?>
+1  A: 

The regex seems okay to me.

However, there are some things I would improve:

  • whitespace with "\s+", instead of " "
  • numbers with "\d", not with "[0-9]" (same thing, it's just shorter)
  • filesize not with ".+", but with something more specific

This would be my version:

(?<percent>\d{1,3}\.\d{1,2})%\s+of\s+(?<filesize>[\d.]+[kBM])

Depending on how much you expect to get wrong number formats (I would guess: not very likely), you can shorten it to:

(?<percent>[\d.]+)%\s+of\s+(?<filesize>[\d.]+[kBM])
Tomalak
+1  A: 

If your stream actually delivers more than 8kb of data in one read, you'll probably truncate the last line, which will prevent it from being matched. Try reading the stream one line at a time using fgets() instead.

Emil H
+1  A: 

I would use fgets() for reading line-based, since you want to match per line I assume. If you match per line instead, you would not need to use preg_match_all, but only preg_match.

You only seem to have 1 decimal in your percentage, but you match 1,2 digits?

jishi
A: 

Is there anything that could go wrong with that regex that will not get it to match with the above input?

Not that I can see, but there's something that does go wrong to make it match far too much: if you really don't have newlines, then this:

(?P<filesize>.+) at

can match greedily from the start to the last “ at” in the input. So if I match against the whole example input you posted, I get a <percent> of:

75.1

(good) and a filesize of:

4.40M at 115.10k/s ETA 00:09 [download] 77.2% of 4.40M at 112.36k/s ETA 00:09 [download] 78.6% of 4.40M at 111.41k/s ETA 00:08 [download] 80.3% of 4.40M at 110.80k/s ETA 00:07 [download] 82.3% of 4.40M at 110.30k/s ETA 00:07 [download] 84.3% of 4.40M at 108.33k/s ETA 00:06 [download] 85.7% of 4.40M at 107.62k/s ETA 00:05 [download] 87.5% of 4.40M at 107.21k/s ETA 00:05 [download] 89.5% of 4.40M at 105.10k/s ETA 00:04 [download] 90.7% of 4.40M at 106.45k/s ETA 00:03 [download] 93.2% of 4.40M at 104.92k/s ETA 00:02 [download] 94.8% of 4.40M at 104.40k/s ETA 00:02 [download] 96.5% of 4.40M at 102.47k/s ETA 00:01 [download] 97.7% of 4.40M at 103.48k/s ETA 00:01 [download] 100.0% of 4.40M at 103.15k/s ETA 00:00 [download] 100.0% of 4.40M

(not quite so good). To avoid this use the non-greedy match “.+?”, or a more specific expression like “[^ ]+” or Tomalak's version.

Could how much that is being read by fread be effecting the regex to work correctly?

Yes. Reading in chunks is quite unreliable: if a ‘[download]’ line is split over a chunk boundary, it will not match and will be lost. You can either:

  • not care, or
  • read the whole input at once, or
  • use line-based reading if there really are newlines in the input (there usually are)
  • manage the buffer manually by retaining the last n characters of the input (where n is the index of the end of the final match found) and appending the new incoming input to it.

As for server differences, the only thing I can think of is that if one of the servers is Windows and one a *ix, they will have different ideas of what a newline is, which might cause the “are there newlines or not?” confusion.

bobince
+5  A: 

I am not that familiar with named capture, but I think in PHP it should be:

$pattern = '/(?P<percent>[0-9]{1,3}\.[0-9]{1,2})% of (?P<filesize>.+) at/';

Notice the P after the question mark.

Source:

jeroen
+1: I've just tested it myself and it works with the Ps but not without them.
Pourquoi Litytestdata
Genius. I am not even going to ask why my regex didn't have a P and it worked on my other server. It works perfectly now on my current server.
Abs
Wow - learn something new everyday, but the ?P is not mentioned in the preg_match_all docs - http://nl.php.net/manual/en/function.preg-match-all.php but it is elsewhere. Can somebody clarify?
Diogenes
I just remembered reading about it in Mastering Regular Expressions and looked it up when I saw the question. Don´t know about the php documentation...
jeroen
It's listed in the PHP docs for Regular Expression Details: http://nl.php.net/manual/en/regexp.reference.php
sirlancelot