tags:

views:

180

answers:

4

I have a string that may look something like this:

$r = 'Filed under: <a>Group1</a>, <a>Group2</a>';

Here is the regular expression I am using so far:

preg_match_all("/Filed under: (?:<a.*?>([\w|\d|\s]+?)<\/a>)+?/", $r, $matches);

I want the regular expression to inside the () to continue to make matches as designated with the +? at the end. But it just won't do it. ::sigh::

Any ideas. I know there has to be a way to do this in one regular expression instead of breaking it up.

+1  A: 

I want the regular expression to inside the () to continue to make matches as designated with the +? at the end.

+? is a lazy quantifier - it will match as few times as possible. In other words, just once.

If you want to match several times, you want a greedy quantifier - +.

Also note that your regex doesn't quite work - the match fails as soon as it encounters the comma between the tags, because you haven't accounted for it. That likely needs correcting.

Anon.
Right, I have tried with just the + quantifier. That fails also. And I did also think about the , [comma] to which I'm afraid I don't know how to set this, since the second or third match may or may not have a comma. I did however try this as my attemp:[code]preg_match_all("/Filed under: (?:<a.*?>([\w|\d|\s]+?)<\/a>.*?)+/", $r, $matches);[/code]
Senica Gonzalez
Hmmm, comments don't look very pretty.
Senica Gonzalez
@Senica: you can use backticks to format code in comments just like you can in questions and answers, but if the code is long or complex, you should edit your question and put it there instead. The code you included above was a bit much for a comment.
Alan Moore
But @Anon. is right: a reluctant quantifier at the end of a regex almost never makes sense. If you regex had been correct otherwise, that final `?` would have broken it.
Alan Moore
+1  A: 
$r = 'Filed under: <a>Group1</a>, <a>Group2</a>'
$s = explode("</a>",$r);
foreach ($s as $k){
    if ($k){
        $k=explode("<a>",$k);
        print "$k[1]\n";
    }
}

output

$ php test.php
Group1
Group2
ghostdog74
Sometimes RegExes really are the best way to do something....
SoapBox
best or not, is up to individual. If it can be done without complicated regex , then to me its best, both for myself and for the one who maintains it.
ghostdog74
As I explained in a comment above, I can't use explode....For one, there are instances where there is not comma and there is only one Group. Two, while my example was simple, this is a complicated file. The <a> tag is not that simple either. Three, I need the Filed under: attribute as using an explode would most certainly return unwanted values.
Senica Gonzalez
+1  A: 

Try:

<?php

$r = 'Filed under: <a>Group1</a>, <a>Group2</a>, <a>Group3</a>, <a>Group4</a>';

if(preg_match_all("/<a.*?>([^<]*?)<\/a>/", $r, $matches)) {
    var_dump($matches[1]); 
}

?>

output:

array(4) {
  [0]=>
  string(6) "Group1"
  [1]=>
  string(6) "Group2"
  [2]=>
  string(6) "Group3"
  [3]=>
  string(6) "Group4"
}

EDIT:

Since you want to include the string 'Filed under' in the search to uniquely identify the match, you can try this, I'm not sure if it can be done using a single call to preg_match

// Since you want to match everything after 'Filed under'
if(preg_match("/Filed under:(.*)$/", $r, $matches)) {
    if(preg_match_all("/<a.*?>([^<]*?)<\/a>/", $matches[1], $matches)) {
        var_dump($matches[1]); 
    }
}
codaddict
Thanks, but I really need to use the "Filed under: " flag. While my example text was rudimentary, the actual file that I am parsing is quite complicated, and Filed under: is really the only unique identifier that I have to work with. Fortunately, it is at the end of the file, so I can match all the way to the end.
Senica Gonzalez
Close enough. :) Thanks.
Senica Gonzalez
+1  A: 

Just for fun here's a regex that will work with a single preg_match_all:

'%(?:Filed under:\s*+|\G</a>)[^<>]*+<a[^<>]*+>\K[^<>]*%`

Or, in a more readable format:

'%(?:
      Filed under:   # your sentinel string
    |                
      \G             # NEXT MATCH POSITION
      </a>           # an end tag
  )
  [^<>]*+          # some non-tag stuff     
  <a[^<>]*+>       # an opening tag
  \K               # RESET MATCH START
  [^<>]+           # the tag's contents
%x'

\G matches the position where the next match attempt would start, which is usually the spot where the previous successful match ended (but if the previous match was zero-length, it bumps ahead one more). That means the regex won't match a substring starting with </a> until after it's matched one starting with Filed under: at at least once.

After the sentinel string or an end tag has been matched, [^<>]*+<a[^<>]*+> consumes everything up to and including the next start tag. Then \K spoofs the start position so the match (if there is one) appears to start after the <a> tag (it's like a positive lookbehind, but more flexible). Finally, [^<>]+ matches the tag's contents and brings the match position up to the end tag so \G can match.

But, as I said, this is just for fun. If you don't have to do the job in one regex, you're better off with a multi-step approach like the one @codaddict used; it's more readable, more flexible, and more maintainable.

\K reference
\G reference

EDIT: Although the references I gave are for the Perl docs, these features are supported by PHP, too--or, more accurately, by the PCRE lib. I think the Perl docs are a little better, but you can also read about this stuff in the PCRE manual.

Alan Moore