ansaurus

Question

Regex: How to match the first word after an expression

Answer 1

A:

ipsum\b(.*)\b

EDIT: although depending on your regex implementation, this could be hungry and find all words after ipsum

ck 2009-02-13 14:53:15

That'll match the rest of the sentence.

cletus 2009-02-13 14:53:57

you have to make that ungreedy

tliff 2009-02-13 14:55:14

Actually it's not implementation dependent, or at least I've never come across a regex implementation that is non-greedy by default. Non-greedy is always a switch (at least in Perl, PHP, Java and .Net).

cletus 2009-02-13 14:56:17

@cletus: regex implementation can by definition include passing switches to the call to the regex function

ck 2009-02-13 15:05:52

Yes but they all default to greedy and you pass in switches to turn that off (although PHP has a switch to invert the behaviour of *? and +? to being greedy while * and + become non-greedy). Still, that's a switch from the default.

cletus 2009-02-13 15:11:25

indeed, it is a change from default :)

ck 2009-02-13 15:24:18

Even if you make it non-greedy--ie, "ipsum\b(.*?)\b"--it still won't work. The "(.*?)" will just match the space between 'ipsum' and the next word.

Alan Moore 2009-02-13 15:30:11

Answer 2

+1 A:

ipsum\b(\w*)

David Kemp 2009-02-13 14:54:19

That seems to only match ipsum.

Matthew Taylor 2009-02-13 14:56:51

I'd probably make that \b+(\w+) at least

cletus 2009-02-13 14:57:04

ipsum\b+(\w+) is not valid regex.

Matthew Taylor 2009-02-13 15:00:04

@Matthew Taylor: It depends on your platform. You didn't specify which platform/language you're using.

Ates Goral 2009-02-13 15:02:55

I see. I'm using Java regex on OS X.

Matthew Taylor 2009-02-13 15:18:11

\b+ matches one or more word boundaries, which makes no sense because a word boundary has zero length. Some flavors will ignore the + but others will reject it as an error. I think "ipsum\s+(\w+)" is what you're groping for.

Alan Moore 2009-02-13 15:22:45

Answer 3

+5 A:

This sounds like a job for lookbehinds, though you should be aware that not all regex flavors support them. In your example:

(?<=\bipsum\s)(\w+)

This will match any sequence of letter characters which follows "ipsum" as a whole word followed by a space. It does not match "ipsum" itself, you don't need to worry about reinserting it in the case of, e.g. replacements.

As I said, though, some flavors (JavaScript, for example) don't support lookbehind at all. Many others (most, in fact) only support "fixed width" lookbehinds — so you could use this example but not any of the repetition operators. (In other words, (?<=\b\w+\s+)(\w+) wouldn't work.)

Ben Blank 2009-02-13 15:01:49

beat me too it :)

annakata 2009-02-13 15:05:18

Lookbehinds tend to be pretty limited when it comes to using wildcards though.

cletus 2009-02-13 15:06:36

Lookbehinds might not even be necessary here. Depending on what 'I want to match' in the question refers to, see David Kemp's solution.

2009-02-13 15:38:57

zero-width tends to be what you want though, it's just that grouping is a trivial get out of jail card.

annakata 2009-02-13 20:57:14

Fixed width is a misleading term - it is more "max width", yes? In most cases it is possible to use a suitable limit, for example: (?<=\b\w{1,100}\s{1,100})

Peter Boughton 2009-02-13 20:57:45

@Peter — No, it really is *fixed* width. Try your regex there in Python; it throws an exception.

Ben Blank 2009-02-13 21:06:41

Answer 4

A:

Some of the other responders have suggested using a regex that doesn't depend on lookbehinds, but I think a complete, working example is needed to get the point across. The idea is that you match the whole sequence ("ipsum" plus the next word) in the normal way, then use a capturing group to isolate the part that interests you. For example:

String s = "Lorem ipsum dolor sit amet, consectetur " +
    "adipiscing elit. Nunc eu tellus vel nunc pretium " +
    "lacinia. Proin sed lorem. Cras sed ipsum. Nunc " +
    "a libero quis risus sollicitudin imperdiet.";

Pattern p = Pattern.compile("ipsum\\W+(\\w+)");
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.println(m.group(1));
}

Note that this prints both "dolor" and "Nunc". To do that with the lookbehind version, you would have to do something hackish like:

Pattern p = Pattern.compile("(?<=ipsum\\W{1,2})(\\w+)");

That's in Java, which requires the lookbehind to have an obvious maximum length. Some flavors don't have even that much flexibility, and of course, some don't support lookbehinds at all.

However, the biggest problem people seem to be having in their examples is not with lookbehinds, but with word boundaries. Both David Kemp and ck seem to expect \b to match the space character following the 'm', but it doesn't; it matches the position (or boundary) between the 'm' and the space.

It's a common mistake, one I've even seen repeated in a few books and tutorials, but the word-boundary construct, \b, never matches any characters. It's a zero-width assertion, like lookarounds and anchors (^, $, \z, etc.), and what it matches is a position that is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one.

Alan Moore 2009-02-13 20:49:29

ansaurus

tags:

views:

answers:

Regex: How to match the first word after an expression

related questions