ansaurus

Question

using regex to skip ahead all characters until a specific sequence of letters is found using negative lookahead

Answer 1

+1 A:

No lookahead/behind required:

/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/

Added the ending '[^]]*]' to check for a real tag end, could be unnecessary.

Edit: added the \b to id as otherwise it could match [keyword you-dont-want-this-guid=123123-132123-123 id=123]

$ php -r 'preg_match_all("/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/","[keyword stuff=otherstuff morestuff=stuff]",$matches);var_dump($matches);'
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(42) "[keyword stuff=otherstuff morestuff=stuff]"
  }
  [1]=>
  array(1) {
    [0]=>
    string(0) ""
  }
}
$ php -r 'var_dump(preg_match_all("/\[keyword(?:[^\]]*?\bid=([0-9]+))?[^\]]*?\]/","[keyword stuff=otherstuff id=123 morestuff=stuff]",$matches),$matches);'
int(1)
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(49) "[keyword stuff=otherstuff id=123 morestuff=stuff]"
  }
  [1]=>
  array(1) {
    [0]=>
    string(3) "123"
  }
}

Wrikken 2010-07-20 00:50:19

I was thinking that was working, but after testing it, it seems id isn't optional and it needs to be.

phazei 2010-07-20 01:04:18

Oh, did not get that, will fix,

Wrikken 2010-07-20 01:05:54

Fixed (in a non-capturing subpattern)

Wrikken 2010-07-20 01:12:53

Tried it, but it doesn't get any matches on the id.

phazei 2010-07-20 01:12:57

Seriously? (did 2 edits in quick succession 9 mins ago b.t.w, the first did indeed not work). What string doesn't match? Entered 2 teststrings which seem to work here.

Wrikken 2010-07-20 01:22:37

Ah, sorry, I must have missed a \ or something. Just got home from work and tried it again, seems to hit right on :)Awesome, thanks!I'm not to sure I understand the first "[^]]*" and why it doesn't match until the last ]. I noticed that the ] from ^] can really be any character that's not used.

phazei 2010-07-20 02:02:18

Be carefull with that last remark: `[keyword ][keyword id=123]` will suddenly have only 1 match instead of the 2 if you don't use [^\]]. It doesn't match untill the last `]` because it's ungreedy (the `?`), so it stops matching as soon what comes after matches the next part, which is also why would couldn't just set the whole \bid etc. in a non-required subpattern of it's own.

Wrikken 2010-07-20 02:14:26

Answer 2

+2 A:

You do not need look ahead / behind.

Since the question is tagged PHP, use preg_match_all() and store the match in $matches.

Here's how:

<?php

  // Store the string. I single quote, in case there are backslashes I
  // didn't see.
$string = 'blah blah[keyword stuff=otherstuff id=123 morestuff=stuff]
           blah blah[otherkeyword stuff=otherstuff id=555 morestuff=stuff]
           blah blah[keyword stuff=otherstuff id=444 morestuff=stuff]';

  // The pattern is '[keyword' followed by not ']' a space and id
  // The space before id is important, so you don't catch 'guid', etc.
  // If '[keyword'  is always at the beginning of a line, you can use
  // '^\[keyword'
$pattern = '/\[keyword[^\]]* id=([0-9]+)/';

  // Find every single $pattern in $string and store it in $matches
preg_match_all($pattern, $string, $matches);

  // The only tricky part you have to know is that each entire match is stored in
  // $matches[0][x], and the part of the match in the parentheses, which is what
  // you want is stored in $matches[1][x]. The brackets are optional, since it's
  // only one line.
foreach($matches[1] as $value)
{     
    echo $value . "<br/>";
}
?>

Output:

123
444

( 555 is skipped, as it should be)

PS

You can also use \b instead of a literal space if there could be a tab instead. \b represents a word boundary... in this case the beginning of a word.

$pattern = '/\[keyword[^\]]*\bid=([0-9]+)/';

Peter Ajtai 2010-07-20 00:51:01

That won't work, because I'm using preg_match_all on a large document that could have [otherkeyword id=324] which I can't match. Also, I have to match [keyword stuff=otherstuff] where there is no id.

phazei 2010-07-20 01:05:39

@phazei Edited my answer to show multiple answers and ignore otherkeyword.

Peter Ajtai 2010-07-20 01:22:09

Cool. You skipped everything after the id, though I need to keep that since I'm using it to replace the entire [keyword x=x] section, but that's no problem for me to change. I see that you fixed the biggest issue I was having the same way Wrikken did with [^]]* right after the keyword. Why does that work and not cause it to skip everything till the last "]"?

phazei 2010-07-20 02:07:31

I skipped everything after the ID, since you said, "I'm trying to pull the id #" and the stuff after the ID isn't the ID #. '[^\]]*\bid=' means any number of things that aren't a close square bracket followed by a whitespace and 'id='.... so it can't skip till the last ']' due to it having to look for '\bid='

Peter Ajtai 2010-07-20 02:51:29

@Peter, `\b` doesn't match whitespace; you're thinking of `\s`. See here for what `\b` really does: http://www.regular-expressions.info/wordboundaries.html

Alan Moore 2010-07-20 03:16:17

@Alan - Whoops, thanks for the correction.

Peter Ajtai 2010-07-20 04:06:23

Answer 3

A:

I think this is what you're getting at:

\[keyword(?:\s+(?!id\b)[A-Za-z]+=[^\]\s]+)*(?:\s+id=([0-9]+))?[^\]]*\]

(I'm assuming attribute names can only contain ASCII letters, while the values can contain any non-whitespace character except ].)

(?:\s+(?!id\b)[A-Za-z]+=[^\]\s]+)* matches any number of attribute=value pairs (and the whitespace preceding them), as long as the attribute name isn't id. The \b (word boundary) is there just in case there are attribute names that start with id, like idiocy. There's no need to put a \b in front of the attribute name this time, because you know any name it matches will be preceded by whitespace. But, as you've learned, the lookahead approach is overkill in this case.

Now, about this:

[A-z0-9 =]

That A-z is either a typo or an error. If you're expecting it to match all uppercase and lowercase letters, well, it does. But it also matches

'[', ']', '^', '_', '`` and '\'

...because their code points lie between those of the uppercase letters and the lowercase letters. ASCII letters, that is.

Alan Moore 2010-07-20 04:37:42

ansaurus

tags:

views:

answers:

using regex to skip ahead all characters until a specific sequence of letters is found using negative lookahead

related questions