views:

42

answers:

3

I'm alright with basic regular expressions, but I get a bit lost around pos/neg look aheads/behinds.

I'm trying to pull the id # from this:

[keyword stuff=otherstuff id=123 morestuff=stuff]

There could be unlimited amounts of "stuff" before or after. I've been using The Regex Coach to help debug what I've tried, but I'm not moving forward anymore...

So far I have this:

\[keyword (?:id=([0-9]+))?[^\]]*\]

Which takes care of any extra attributes after the id, but I can't figure out how to ignore everything between keyword and id. I know I can't go [^id]* I believe I need to use a negative lookahead like this (?!id)* but I guess since it's zero-width, it doesn't move forward from there. This doesn't work either:

\[keyword[A-z0-9 =]*(?!id)(?:id=([0-9]+))?[^\]]*\]

I've been looking all over for examples, but haven't found any. Or perhaps I have, but they went so far over my head I didn't even realize what they were.

Help! Thanks.

EDIT: It has to match [keyword stuff=otherstuff] as well, where id= doesn't exist at all, so I have to have a 1 or 0 on the id # group. There are also other [otherkeywords id=32] which I do not want to match. The document needs to match multiple [keyword id=3] throughout the documents using preg_match_all.

+1  A: 

No lookahead/behind required:

/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/

Added the ending '[^]]*]' to check for a real tag end, could be unnecessary.

Edit: added the \b to id as otherwise it could match [keyword you-dont-want-this-guid=123123-132123-123 id=123]

$ php -r 'preg_match_all("/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/","[keyword stuff=otherstuff morestuff=stuff]",$matches);var_dump($matches);'
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(42) "[keyword stuff=otherstuff morestuff=stuff]"
  }
  [1]=>
  array(1) {
    [0]=>
    string(0) ""
  }
}
$ php -r 'var_dump(preg_match_all("/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/","[keyword stuff=otherstuff id=123 morestuff=stuff]",$matches),$matches);'
int(1)
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(49) "[keyword stuff=otherstuff id=123 morestuff=stuff]"
  }
  [1]=>
  array(1) {
    [0]=>
    string(3) "123"
  }
}
Wrikken
I was thinking that was working, but after testing it, it seems id isn't optional and it needs to be.
phazei
Oh, did not get that, will fix,
Wrikken
Fixed (in a non-capturing subpattern)
Wrikken
Tried it, but it doesn't get any matches on the id.
phazei
Seriously? (did 2 edits in quick succession 9 mins ago b.t.w, the first did indeed not work). What string doesn't match? Entered 2 teststrings which seem to work here.
Wrikken
Ah, sorry, I must have missed a \ or something. Just got home from work and tried it again, seems to hit right on :)Awesome, thanks!I'm not to sure I understand the first "[^]]*" and why it doesn't match until the last ]. I noticed that the ] from ^] can really be any character that's not used.
phazei
Be carefull with that last remark: `[keyword ][keyword id=123]` will suddenly have only 1 match instead of the 2 if you don't use [^\]]. It doesn't match untill the last `]` because it's ungreedy (the `?`), so it stops matching as soon what comes after matches the next part, which is also why would couldn't just set the whole \bid etc. in a non-required subpattern of it's own.
Wrikken
+2  A: 

You do not need look ahead / behind.

Since the question is tagged PHP, use preg_match_all() and store the match in $matches.

Here's how:

<?php

  // Store the string. I single quote, in case there are backslashes I
  // didn't see.
$string = 'blah blah[keyword stuff=otherstuff id=123 morestuff=stuff]
           blah blah[otherkeyword stuff=otherstuff id=555 morestuff=stuff]
           blah blah[keyword stuff=otherstuff id=444 morestuff=stuff]';

  // The pattern is '[keyword' followed by not ']' a space and id
  // The space before id is important, so you don't catch 'guid', etc.
  // If '[keyword'  is always at the beginning of a line, you can use
  // '^\[keyword'
$pattern = '/\[keyword[^\]]* id=([0-9]+)/';

  // Find every single $pattern in $string and store it in $matches
preg_match_all($pattern, $string, $matches);

  // The only tricky part you have to know is that each entire match is stored in
  // $matches[0][x], and the part of the match in the parentheses, which is what
  // you want is stored in $matches[1][x]. The brackets are optional, since it's
  // only one line.
foreach($matches[1] as $value)
{     
    echo $value . "<br/>";
}
?>

Output:

123
444   

( 555 is skipped, as it should be)

PS

You can also use \b instead of a literal space if there could be a tab instead. \b represents a word boundary... in this case the beginning of a word.

$pattern = '/\[keyword[^\]]*\bid=([0-9]+)/';
Peter Ajtai
That won't work, because I'm using preg_match_all on a large document that could have [otherkeyword id=324] which I can't match. Also, I have to match [keyword stuff=otherstuff] where there is no id.
phazei
@phazei Edited my answer to show multiple answers and ignore otherkeyword.
Peter Ajtai
Cool. You skipped everything after the id, though I need to keep that since I'm using it to replace the entire [keyword x=x] section, but that's no problem for me to change. I see that you fixed the biggest issue I was having the same way Wrikken did with [^]]* right after the keyword. Why does that work and not cause it to skip everything till the last "]"?
phazei
I skipped everything after the ID, since you said, "I'm trying to pull the id #" and the stuff after the ID isn't the ID #. '[^\]]*\bid=' means any number of things that aren't a close square bracket followed by a whitespace and 'id='.... so it can't skip till the last ']' due to it having to look for '\bid='
Peter Ajtai
@Peter, `\b` doesn't match whitespace; you're thinking of `\s`. See here for what `\b` really does: http://www.regular-expressions.info/wordboundaries.html
Alan Moore
@Alan - Whoops, thanks for the correction.
Peter Ajtai
A: 

I think this is what you're getting at:

\[keyword(?:\s+(?!id\b)[A-Za-z]+=[^\]\s]+)*(?:\s+id=([0-9]+))?[^\]]*\]

(I'm assuming attribute names can only contain ASCII letters, while the values can contain any non-whitespace character except ].)

(?:\s+(?!id\b)[A-Za-z]+=[^\]\s]+)* matches any number of attribute=value pairs (and the whitespace preceding them), as long as the attribute name isn't id. The \b (word boundary) is there just in case there are attribute names that start with id, like idiocy. There's no need to put a \b in front of the attribute name this time, because you know any name it matches will be preceded by whitespace. But, as you've learned, the lookahead approach is overkill in this case.

Now, about this:

[A-z0-9 =]

That A-z is either a typo or an error. If you're expecting it to match all uppercase and lowercase letters, well, it does. But it also matches

'[', ']', '^', '_', '`` and '\'

...because their code points lie between those of the uppercase letters and the lowercase letters. ASCII letters, that is.

Alan Moore