views:

203

answers:

6

I have a few regular expressions which are run against very long strings. However, the only part of the string which concerns the RE is near the beginning. Most of the REs are similar to:

\\s+?(\\w+?).*

The REs capture a few groups near the start, and don't care what the rest of the string is. For performance reasons, is there a way to have the RE engine avoid looking at all the characters consumed by the terminating .*?

Note: The application with the REs is written using the java.regex classes.

Edit: For example I have the following RE:

.*?id="number"[^>]*?>([^<]+?).*

Which is run against large HTML files which are stored as StringBuilders. The tag with id="number" is always near the start of the HTML file.

+2  A: 

Why not just take out .*, you dont need it.

^\\s+?(\\w+?)
Michael
That does not work for me. Removing the `.*` causing the matching to fail.
Ben S
if \\s+?(\\w+?).* the whole regexp? or is it part of a bigger expression?
Michael
Ben, what are you trying to match and on what input does taking the .* out fail, using what specific java code?
Peter Recore
i guess since you've chosen greg's answer that you were using matches, which would explain why michael's answer didn't work for you.
Peter Recore
Yup, once I switched over to `find()` removing the `.*` was possible.
Ben S
+6  A: 

When using the java.util.regex classes, there are a number of ways to match against a given string. Matcher.matches always matches against the whole input string. Matcher.find looks for something matching your regular expression somewhere within the input string. Finally, Matcher.lookingAt matches your regular expression against the beginning of your input string.

If you are using Matcher.matches you may require the .* at the end to match the whole string. However, you might be better off using one of the other methods instead, which would allow you to leave off the .*. It sounds like Matcher.lookingAt may be appropriate for your purposes.

Greg Hewgill
I was using `Matcher.matches` when I wanted the behavior of `Matcher.find`, thanks!
Ben S
A: 

If you are dealing with HTML, regular expressions are not the right tool to do the analysis unless you have 100% control over the data files. It will eventually break.

It looks to me that you need the content of the tag which has id="number" and apparently more too. Lenient parsers exist allowing XSLT transformation on HTML input, which might be exactly what you need. I will look it up if you are interested.

Thorbjørn Ravn Andersen
Unfortunately, unless you have control over your data files, there is NO right tool for parsing HTML. The world is full of HTML files that break the various spec, and will trip up any of the regular Java HTML parsers. For example, an XSLT transform will fail if the HTML is not readable as (well-formed) XML.
Stephen C
Using regular expressions was a design decision we made. The regular expressions allow us to be a little fuzzy in how we find the data. Whereas HTML-specific solutions typically expect the markup to have specific layout with exact IDs and a known element hierarchy.
Ben S
@Stephen C, when I say "Lenient parsers exists allowing XSLT with HTML input" I mean that HTML parsers actually exists which either builds a DOM tree or a SAX stream. I recall having read that the HTML parser in Swing can be tweaked to do this. If I was referring to well-formed XHTML I would have said so :-)
Thorbjørn Ravn Andersen
@Ben S, XPath expressions allow you to easily select the text of the node which has an attribute named "id" with the value "number".
Thorbjørn Ravn Andersen
A: 

There's a great library for dealing with HTML files--including badly formed, real-world ones: BeautifulSoup http://www.crummy.com/software/BeautifulSoup/

It would be very easy to find your id= tag with this library

ראובן
A: 

In this case particular case, the simple answer was to use 'find' rather than 'matches'. But if that doesn't work for you, the Java Pattern class supports regexes with so-called possessive quantifiers that can be used to prevent backtracking.

Possessive quantifiers are the 3rd alternative to greedy and reluctant quantifiers. The syntax is in Java is 'X?+' or 'X*+' or 'X++'. Possessive quantifiers match as many characters as possible (like greedy quantifiers), but if the rest of the pattern does not match the possessive quantifier fails instead of backing off. (Sort of like a "cut" in Prolog.)

But beware that using a possessive quantifier instead of a greedy or reluctant one will change the meaning of your pattern.

There is tutorial information about possessive quantifiers at this page.

Stephen C
+1  A: 
.*?id="number"[^>]*?>([^<]+?).*

Is that really the regex you're using? The reason I ask is because ([^<]+?) will always match exactly one character, as if you had written ([^<]) instead. The + quantifier has to match at least once, but because it's reluctant it immediately hands off to the next part - .* - which always succeeds. Removing the .* and switching to find() or lookingAt() won't change that behavior, either (although it will probably be a little quicker to get the same result). If you want to match all the text up to the next angle bracket, you should get rid of the question mark: ([^<]+) .

[^>]*?> doesn't make much sense, either. You have to consume as many non-brackets as there are before you can match the bracket, so what's the point of making that quantifier reluctant? In fact, there's no point making it greedy either; if [^>]* matches as much as it can and the next character isn't '>', you know backtracking won't do any good. You might as well use a possessive quantifier - [^>]*+> - or an atomic group - (?>[^>]*+)> - if your regex flavor supports them.

The first quantified portion - .*? - is the only one that's used correctly (if not optimally). Putting that at the beginning of a regex simulates the behavior of find() when you're using lookingAt() or (with a .* at the end) matches(). However, leaving it off and using find() is more efficient, as you've discovered.

Reluctant quantifiers are very handy, but lately it seems like they've been getting overexposed. With increasing frequency I see people giving the advice "Use reluctant quantifiers" with no explanation or qualification--just another silver bullet. And I believe regexes like the one in this question are the result. Of the three reluctant quantifiers, one should have been greedy, one should have been possessive, and the other shouldn't have been there at all.

EDIT: Here's an example to illustrate some of what I'm talking about, and to address Stephen C's comment. Given this string:

<div id="number" class="whatever">abc123</div>

...the dynamic parts of the regex match like this:

.*?         => '<div '

[^>]*?      => ' class="whatever"'

([^<]+?)    => 'a'

.*          => 'bc123</div>'

Changing all the reluctant quantifiers to greedy doesn't change the overall match (the whole string), and it doesn't change what gets matched by the first two dynamic portions. But the last two get reapportioned:

([^<]+)     => 'abc123'

.*          => '</div>'

Looking at the original regex, I thought this must be the desired result; why use such a complicated subexpression inside a capturing group if not to capture the whole content, 'abc123'? That's what leads me to believe the reluctant quantifiers were used blindly, as a panacea.

One other thing: looking back over the thread, I see the OP didn't actually say he had removed the .*? from the front of the regex when he switched to the find() method. @Ben, if you haven't done that, you should; it's just slowing things down now. That would leave you with this regex:

id="number"[^>]*+>([^<]+)

I don't want anyone to think I'm contesting the accepted answer, either. I'm just scratching this itch I have about the overuse/inappropriate use of reluctant quantifiers.

Alan Moore
Use of greedy versus reluctant quantifiers only affects the order in which the alternatives are tried. If the pattern does not match, the regex engine has to try all alternatives anyway.
Stephen C
@Stephen: yes, but it can change which *parts* of the regex match which parts of the input. I added an example to my answer; it's too complicated for a comment (even with all these nice new formatting options).
Alan Moore