views:

434

answers:

3

I have been banging my head against this for some time now: I want to capture all [a-z]+[0-9]? character sequences excluding strings such as sin|cos|tan etc. So having done my regex homework the following regex should work:

(?:(?!(sin|cos|tan)))\b[a-z]+[0-9]?

As you see I am using negative lookahead along with alternation - the \b after the non-capturing group closing parenthesis is critical to avoid matching the in of sin etc. The regex makes sense and as a matter of fact I have tried it with RegexBuddy and Java as the target implementation and get the wanted result but it doesn't work using Java Matcher and Pattern objects! Any thoughts?

cheers

+3  A: 

The \b is in the wrong place. It would be looking for a word boundary that didn't have sin/cos/tan before it. But a boundary just after any of those would have a letter at the end, so it would have to be an end-of-word boundary, which is can't be if the next character is a-z.

Also, the negative lookahead would (if it worked) exclude strings like cost, which I'm not sure you want if you're just filtering out keywords.

I suggest:

\b(?!sin\b|cos\b|tan\b)[a-z]+[0-9]?\b

Or, more simply, you could just match \b[a-z]+[0-9]?\b and filter out the strings in the keyword list afterwards. You don't always have to do everything in regex.

bobince
Matches `cos1` but it should not (if I understood the requirement correctly).
Tomalak
@Tomalak: No, the negative lookahead is meant to match full words, not prefixes. If there were a trig function called `cos1`, it would be listed as such: `(?!(?:sin|cos1?|tan)\b)`
Alan Moore
Yeah, the requirements aren't wholly clear, but that was my guess.
bobince
@bobince: Thanks, you were right about the the positioniong of `\b`. Of course the original regex would match (although not completely correct according to the equirements i described) most of what i wanted if i hand't forgotten to escape the `\b` for java i.e. `\\b`. Now i think how ridiculous `\\\\ ` will look when you want to include a literal `\ ` in the regex...
nvrs
Yeah, backslashes easily get out of hand in nested escaping contexts! It's a pity Java doesn't have the ‘raw strings’ some languages use to get around the problem. (Or regex literals like in JS, though I personally find that a bit ugly.)
bobince
@nvrs, The problem is solved, then? Have you considered marking this answer "accepted"? It improves on your regex in ways other than the escaping issue you mentioned.
Alan Moore
Yes, answer accepted. I didn't know i had to mark it as such.
nvrs
The difference between the regex in this answer and the regex in the question is not the positioning of the initial word bounary but the addition of the trailing word bounaries. The regexes `(?!lookahead)\b` and `\b(?!lookahead)` yield the same matches. Both `\b` and `(?!lookahead)` are zero-width so they're attempted at the same position regardless of their order.
Jan Goyvaerts
+1  A: 

So you want [a-z]+[0-9]? (a sequence of at least one letter, optionally followed by a digit), unless that letter sequence resembles one of sin cos tan?

\b(?!(sin|cos|tan)(?=\d|\b))[a-z]+\d?\b

results:

cos   - no match
cosy  - full match
cos1  - no match
cosy1 - full match
bla9  - full match
bla99 - no match
Tomalak
Hi, thanks for replying but i still dont get any matches. I see that based on what i said you added matches such as cosy etc. which is correct but using: Pattern p = Pattern.compile("\b(?!(sin|cos|tan)(?=[^a-z]|\b))[a-z]+[0-9]?\b");Matcher m = f.matcher(stringToMatch);i get no matches at all!
nvrs
In Java strings backslashes need to be escaped. I have shown the pure regex. Of course you need to adapt it to the string escaping rules of your programming language yourself.
Tomalak
A: 

i forgot to escape the \b for java so \b should be \\b and it now works. cheers

nvrs
When posting regex questions, it's a good idea to include the regex exactly as it appears in your source code; `\bfoo\b` looks fine, but `"\bfoo\b"` is likely to raise questions, even from people who don't speak Java and aren't sure how its string literals work.
Alan Moore
Also, did you try having RegexBuddy generate the Java source code? (That's the "Use" tab, in case you don't know.) I've never liked auto-generated source code, but I sometimes use "Use" to remind myself about the escaping rules for languages I'm not fluent in.
Alan Moore