tags:

views:

110

answers:

4

I am working on some regex and I wonder why this regex

"(?<=(.*?id(( *)=)\\s[\"\']))g"

does not match the string

<input id = "g" />

in java??

Thanks

+6  A: 

Java.util.regex does not support infinite look-behind, as described in by RegexBuddy:

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

To add a little clarification from the documentation:

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

Some regex flavors, like PCRE and Java support the above, plus alternation with strings of different lengths. Each part of the alternation must still have a finite maximum length. This means you can still not use the star or plus, but you can use the question mark and the curly braces with the max parameter specified. These regex flavors recognize the fact that finite repetition can be rewritten as an alternation of strings with different, but fixed lengths. Unfortunately, the JDK 1.4 and 1.5 have some bugs when you use alternation inside lookbehind. These were fixed in JDK 1.6.

Mike
That text is from an older version of the tutorial, and it's very poorly worded. The updated version at his website is much clearer: http://www.regular-expressions.info/lookaround.html PCRE accepts alternatives in which every alternative is fixed-length but not necessarily all the *same* length. Every thing else in that paragraph applies to Java alone.
Alan Moore
A: 

java.util.regex doesn't support infinite repetition inside lookbehind

highlycaffeinated
+2  A: 

So a couple of people have explained why your regexp is not working (and it's fatal really; Java regular expressions can't do what you need). However, you might wondering how you should now parse this ...

It looks like the string you're trying to parse is XML. Regex is really not a good approach to parsing XML; there is a mismatch between what can be encoded in XML and what can be matched using regular expressions. So if this is part of some XML text, maybe consider slurping it into an XML parser that you can then query for the different elements.

For a calm and reasonable discussion of this issue, see this classic stackoverflow post: RegEx match open tags except XHTML self-contained tags.

Hope this helps!

Irish Buffer
+2  A: 

Not only does Java not allow unbounded lookbehind, it's supposed to throw an exception if you try. The fact that you're not seeing that exception is itself a bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6695369

You shouldn't be using lookbehind for that anyway. If you want to match the value of a certain attribute, the easiest, least troublesome approach is to match the whole attribute and use a capturing group to extract the value. For example:

String source = "<input id = \"g\" />"; 
Pattern p = Pattern.compile("\\bid\\s*=\\s*\"([^\"]*)\"");
Matcher m = p.matcher(source);
if (m.find())
{
  System.out.printf("Found 'id' attribute '%s' at position %d%n",
                    m.group(1), m.start());
}

output:

Found 'id' attribute 'g' at position 7

Do yourself a favor and forget about lookbehinds for a while. They're tricky even when they're not buggy, and they're really not as useful as you might expect.

Alan Moore