I think the most idiosyncratic feature of Java regexes is lookbehinds. It's the only flavor I know of that supports variable-length lookbehinds, but only if the maximum possible length can be determined ahead of time. By comparison, the .NET and JGSoft (RegexBuddy, EditPadPro, PowerGrep) flavors allow unlimited-length lookbehinds, while most others require them to be fixed-length. For example:
(?<=\w{3}) // all
(?<=\w+) // .NET and JGSoft only
(?<=\w{1,3}) // .NET, JGSoft and Java
There's also the PCRE (PHP 5) and Oniguruma (Ruby 1.9) flavors, which allow different-length alternatives in a lookbehind, but only if each alternative is fixed-length, and only at the top level. So (?<=MILK|HONEY)
is okay, but (?<=(MILK|HONEY))
isn't.
If you try to use an unlimited-length lookbehind in Java, it's supposed to throw an exception at compile time (that is,when you try to create the Pattern object). Unfortunately it doesn't always do that, and the behavior of the resulting Pattern is undefined. A bug report has been filed against JDK 1.6, but the bug appears to have been present since JDK 1.4.
It's not as big a problem as you might think, though. For example, (?<=\w+)
works as you would expect, even though it shouldn't compile. It's only when the lookbehind starts with an indeterminate component like \w+
and contains at least one other besides (e.g., (?<=\w+:\w+)
) that it fails to match valid inputs. If the initial component is determinate (e.g., (?<=:\w+)
) a PatternSyntaxException is thrown at compile time as it should be.