views:

152

answers:

5

Are there common patterns that people often use regex for (Java flavor) that is usually:

  • incorrect due to various corner cases (but works "most of the time")
  • correct but very slow
  • etc...

Also, more generally, what other tips are there for people who use regex-es in Java? I find that I often struggle due to:

  • lack of named capturing group
  • poor readability due to gratuitous escaping
  • etc...
+2  A: 

i'm often tempted to use a 'while not equals' type pattern, ie "([^"]*)" to grab everything within a quote, but i often think this is inefficient and too 'capture-all' in design.

sorry, this isn't an answer, but interested in thoughts on this and it is pretty closely related.

pstanton
No, this is precisely the kind of discussion that I'm looking for. I haven't had extensive experience with regexes to know what is and isn't efficient, what are the dos/don'ts, and in particular, the best practices for using it within Java.
polygenelubricants
What's too "capture-all" about that? `[^"]*` matches everything you want and nothing else, with no backtracking. The only way it could be too greedy is if the closing quote is missing, and you can deal with that by making the `*` possessive: `"([^"]*+)"`
Alan Moore
thanks alan, i'll read up on that and test it out soon.
pstanton
+2  A: 

Not using precompiled patterns, more on this here.

ponzao
+1  A: 

Counting the number of backslashes required to escape the String and the regular expression correctly.

ScArcher2
+3  A: 

Using Java regexes to analyse structured text when you should really be using a proper parser; e.g.

  • quoted strings with escape sequences
  • balanced brackets
  • XML and HTML
  • any programming language.

In most cases, you can find an existing Java parser.

EDIT - in response to the comments about using regexes to handle malformed input.

Firstly, let us be clear that we are not talking about browsers rejecting webpages. We are talking about Java software using regexes to "parse" things.

My point is that if you use a regex "parser" to deal with malformed XML or HTML, you have no guarantees that it is actually going to deal with it correctly. For example, if the next file is malformed in a different way ... or is well-formed but different to what your regexes were coded to expect, then your regex is liable to give you garbage.

By contrast, if you use a strict parser, you will know if the input is well-formed, and will have an opportunity to take appropriate steps to fix the problem.

And if you use a permissive / error correcting parser, then you will at least get the mistakes handled / fixed in a consistent manner. This applies particularly to the HTML case.

Finally, if you are frequently having to deal with broken XML, you should be 1) "pushing back" to get the software that produced it fixed, and/or 2) considering writing a custom parser to convert the broken stuff into well-formed XML.

Stephen C
**This is the best advice yet!** Learn to recognize the differences between a "regular expression" and a "context free grammar" - It will save you many hours of frustration. Too many people think they have found ways to parse text described using a context free grammar with regex. It will not work - it cannot be done no matter how clever you think you are.
NealB
Yeah, it's good *general* advice, but none of that is specific to Java.
Alan Moore
@NealB: in addition to having not much to do with the question as commented by Alan Moore, you have the issue of malformed inputs. Browsers are notorious for being able to correctly display malformed inputs which is quite a feat and which ain't achieved as you think it is.
Webinator
@Wizard - surely a regex is not going to help you deal with malformed input. Indeed, it might even make things worse. The best way to deal with malformed input is to either reject it outright, or write / use a parser (like HTMLTidy) that will understand it.
Stephen C
@Stephen C: of course regexp help with malformed inputs. I've even seen posts here about how regexps can save the day when parsing truncated/malformed XML files :) Rejecting malformed input is not always possible. In the case of malformed webpage it is downright unacceptable to refuse to render them (or people will start using another browser because too many pages are malformed), which is why all web browsers "do their best" with malformed input. Other valid uses that have been pointed here are to save "what can be saved" once the sh!t hit the fan and the only working copy is garbled.
Webinator
+1  A: 

I think the most idiosyncratic feature of Java regexes is lookbehinds. It's the only flavor I know of that supports variable-length lookbehinds, but only if the maximum possible length can be determined ahead of time. By comparison, the .NET and JGSoft (RegexBuddy, EditPadPro, PowerGrep) flavors allow unlimited-length lookbehinds, while most others require them to be fixed-length. For example:

(?<=\w{3})     // all
(?<=\w+)       // .NET and JGSoft only
(?<=\w{1,3})   // .NET, JGSoft and Java

There's also the PCRE (PHP 5) and Oniguruma (Ruby 1.9) flavors, which allow different-length alternatives in a lookbehind, but only if each alternative is fixed-length, and only at the top level. So (?<=MILK|HONEY) is okay, but (?<=(MILK|HONEY)) isn't.

If you try to use an unlimited-length lookbehind in Java, it's supposed to throw an exception at compile time (that is,when you try to create the Pattern object). Unfortunately it doesn't always do that, and the behavior of the resulting Pattern is undefined. A bug report has been filed against JDK 1.6, but the bug appears to have been present since JDK 1.4.

It's not as big a problem as you might think, though. For example, (?<=\w+) works as you would expect, even though it shouldn't compile. It's only when the lookbehind starts with an indeterminate component like \w+ and contains at least one other besides (e.g., (?<=\w+:\w+)) that it fails to match valid inputs. If the initial component is determinate (e.g., (?<=:\w+)) a PatternSyntaxException is thrown at compile time as it should be.

Alan Moore
Thanks for a very detailed answer on a very subtle quirk of Sun's regex.
polygenelubricants