tags:

views:

894

answers:

13

Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using reg exprs.

For example: Why is a complete email validation too complex for a regular expression?

A: 

My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)

James Curran
A: 

A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.

BoltBait
Validating wether or not an email adheres to the RFC 2822 standard is not the same problem of determining if that email address is actually in use.
Simucal
True. But the original question was "When is a problem too complex for a regular expression?" and I gave an example of a problem that is too complex for regular expressions. So, to whoever down-voted me, "you're a lamer!" :P
BoltBait
+3  A: 

Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.

Coderer
+7  A: 

When you need to parse an expression that's not defined by a regular language.

Adam Rosenfield
Well, there are Perl extensions. They bring regexps out of class of regular languages.
ADEpt
I would like to see a more pragmatical approach, but this is the right answer so far.
Null303
Following the link... "it can be described by a formal regular expression." Your definition is circular. :P
BoltBait
That is just one of the 10 equivalent definitions.
Adam Rosenfield
SO is so hopelessly biased towards extremely minimalist, virtually useless answers.
temp2290
+4  A: 

Here's a good quote from Raymond Chen:

Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.

Source

Niniki
+2  A: 

Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.

Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).

This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).

ADEpt
+2  A: 
sergdev
I am very interested in the Mathematics behind the topic. Please, can you say where you got your expression. It is hard for me to understand things without Mathematics.
Masi
I cannot understand the real meaning of limitations here. You cannot make conditions like: 'if (n = k) print "a" seven times;' You cannot write a if-sentence in regex?
Masi
If have understood right, you cannot do an impliction in Regex(or let say it is not its main purpose). Hence, you cannot do things like parenthesis matching or "write regexp for a word described by n chars a". Regex is only about matching.
Masi
you can write the word "n chars a", but you can't write word "n times a, n times b for any n". Note for limited n you can do so. The math can be find at http://en.wikipedia.org/wiki/Regular_language.
sergdev
+10  A: 

Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:

(())()

Are all the open parens matched with a close paren?

Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.

Daniel Spiewak
.NET regex flavor is extended to be able to solve the parentheses matching problem (http://msdn.microsoft.com/en-us/library/bs2twtah.aspx). It seems they cheated in Microsoft. :)
Alexander Prokofyev
+1  A: 

Whenever you can't be sure it really solves the problem, for example:

  • HTML parsing
  • Email validation
  • Language parsers

Especially so when there already exist tools that solve the problem in a totally understandable way.

Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.

This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.

Vinko Vrsalovic
A: 

This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"

For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer

Hemal Pandya
+7  A: 

What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.

Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b

It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.

To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.

Here is a sample regular expression attempting to adhere to RFC 2822:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.

Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).

Simucal
are you sure you don't mean this on the first example?\b[A-Z0-9._%+-]+@[A-Z0-9-]+\.[A-Z]{2,4}\b
Keng
Can you highlight the difference between what you just posted and what I have? I dont see it.
Simucal
The difference is the dot in the second set of brackets (Simucal's would match any character [besides newline depending on arguments] before the literal dot), and neither of those is really great (they don't allow for subdomains, for example).
eyelidlessness
+7  A: 

A few things to look out for:

  1. beginning and ending tag detection -- matched pairing
  2. recursion
  3. needing to go backwards (though you can reverse the string, but that's a hack)

regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.

Jeff Atwood
+3  A: 

Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.

Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.

Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).

Jan Goyvaerts