tags:

views:

95

answers:

2

I am trying to write a regular expression for somethin like

s1 = I am at Boston at Dowtown
s2 = I am at Miami

I am interested in the words after at eg: Boston, Downtown, Miami

I have not been successful in creating a regex for that. Somethin like

> .*? (at \w+)+.*

gives just Boston in s1 (Downtown is missed). it just matches the first "at" Any suggestions

+7  A: 

Try this

 at\s+(\w+)

The complete code snippet would be

Pattern myPattern = Pattern.compile("at\\s+(\\w+)", Pattern.DOTALL, Pattern.CASE_INSENSITIVE);
Matcher m = myPattern.matcher(yourString);

while(m.find()) {
  String word = m.group(1);
}
arclight
This. You're matching the entire of at[whitespace][word], but the word is in (), so you're only returning the word for group. Of course, you're going to have problems if you say "I am at the ball field", because this will match "at the" and return "the".
glowcoder
+1  A: 

You seem to expect (at \w+)+ to match both at Boston and at Downtown in the first string. That doesn't work because you don't allow for the space before the second at. You would need to change it to ( at \w+)+--or better, change that to a non-capturing group and use the capturing group for the part that really interests you:

Pattern p = Pattern.compile(".*?(?: at (\\w+))+.*");
String s1 = "I am at Boston at Downtown";
Matcher m = p.matcher(s1);
if (m.matches()) {
    System.out.println(m.group(1));
}

But now it only prints Downtown. That's because you're trying to use one capturing group to capture two substrings. The first time (?: at (\\w+))+ matches, it captures Boston; the second time, it discards Boston and captures Downtown instead.

There are some regex flavors that will let you retrieve intermediate captures (Boston in this example), but Java is not one of them. Your best option is probably to use find() instead of matches(), as @arclight suggested. That makes the regex simpler, too:

Pattern p = Pattern.compile("\\bat\\s+(\\w+)");
String s1 = "I am at Boston at Downtown";
Matcher m = p.matcher(s1);
while (m.find()) {
    System.out.println(m.group(1));
}

You don't have to match the space before at any more, but you probably want to use the \b (word boundary) to avoid partial-word matches (e.g., My cat is at Boston at Downtown). And it's usually a good idea to use \s+ instead of a literal space, in case there are multiple spaces, or the space is really a TAB or something.

Alan Moore
+1; I really wish that Java can do "intermediate captures" (is that what they're called?). I think C# does this. And it can do infinite lookbehind too, which I've figured now how to (ab)use in Java quite reliably.
polygenelubricants
@Alan: also, this can still miss stuff like `"I'm at at Boston"`, but that's probably ok.
polygenelubricants
@poly, Good call. But it's easy enough to fix: `"\\bat\\s+((?!at\\b)\\w+)"`. And yes, the .NET regex flavor supports intermediate captures. The only other flavor I know of that supports them is Perl.
Alan Moore