tags:

views:

1087

answers:

7

It is clear that there are lots of problems that look like a simple regex expression will solve, but which prove to be very hard to solve with regex.

So how does someone that is not an expert in regex, know if he/she should be learning regex to solve a given problem?

(See "Regex to parse C# source code to find all strings" for way I am asking this question.)

This seems to sums it up well:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems...

(I have just changed the title of the question to make it more specific, as some of the problems with Regex in C# are solved in Perl and JScript, for example the fact that the two levels of quoting makes a Regex so unreadable.)

+12  A: 

Don't try to use regex to parse hierarchical text like program source (or nested XML): they are proven to be not powerful enough for that, for example, they can't, for a string of parens, figure out whether they're balanced or not.

Use parser generators (or similar technologies) for that.

Also, I'd not recommend using regex to validate data with strict formal standards, like e-mail addresses. They're harder than you want, and you'll either have unaccurate or a very long regex.

alamar
I'd just like to note that certain regular expression engines do support recursive patterns which allow you to match XML and other nested structures.
Blixt
Like which? I'm talking about PCRE mainly.
alamar
Even if you do have support for recursive regular expression patterns, performance-wise you're better off with recursive descent parsers anyway.
Dean Michael
These days for most code “performance-wise” is how long it takes to write or maintain the code. Processing very large files/datasets is an exception.However I think recursive regular expression patterns will add to maintenance cost
Ian Ringrose
Tim Pietzcker
+1  A: 

I'm a beginner when it comes to regex, but IMHO it is worthwhile to spend some time learning basic regex, you'll realise that many, many problems you've solved differently could (and maybe should) be solved using regex.

For a particular problem, try to find a solution at a site like regexlib, and see if you can understand the solution.

As indicated above, regex might not be sufficient to solve a specific problem, but browsing a browsing a site like regexlib will certainly tell you if regex is the right solution to your problem.

Martijn
I don't think using "magic" strings rather then well designed objects to control command despatching is elegance. Code should be clear, testable and easy to read
Ian Ringrose
Of course, you shouldn't use 'magic' regular expressions, but browsing through a library can be very helpful when learning a 'language' (at least, it is for me).
Martijn
A: 

You should always learn regular expressions - only this way you can judge when to use them. Normally they get problematic, when you need very good performance. But often it is a lot easier to use a regex than to write a big switch statement.

Have a look at this question - which shows you the elegance of a regex in contrast to the similar if() construct ...

tanascius
A: 

Use regular expressions for recognizing (regular) patterns in text. Don't use it for parsing text into data structures. Don't use regular expressions when the expression becomes very large.

Often it's not clear when not to use a regular expression. For example, you shouldn't use regular expressions for proper email address verification. At first it may seem easy, but the specification for valid email addresses isn't as regular as you might think. You could use a regular expression to initial searching of email address candidates. But you need a parser to actually verify if the address candidate conforms to the given standard.

elmuerte
+3  A: 

There are two aspects to consider:

  • Capability: is the language you are trying to recognize a Type-3 language (a regular one)? if so, then you might use regex, if not, you need a more powerful tool.

  • Maintainability: If it takes more time write, test and understand a regular expression than its programmatic counterpart, then it's not appropriate. How to check this is complicated, I'd recommend peer review with your fellows (if they say "what the ..." when they see it, then it's too complicated) or just leave it undocumented for a few days and then take a look by yourself and measure how long does it take to understand it.

fortran
A: 

At the very least, I'd say learn regular expressions just so that you understand them fully and be able to apply them in situations where they would work. Off the top of my head I'd use regular expressions for:

  • Identifying parts of a string.
  • Checking whether a string conforms to a certain format or construction.
  • Finding substrings that match a certain pattern.
  • Transforming strings that fit a certain pattern into a different form (search-replace, capitalization, etc.).

Regular expressions at a theoretical level form the foundations of what a state machine is -- in computer science, you have Deterministic Finite Automata (DFA) and Non-deterministic Finite Automata (NFA). You can use regular expressions to enforce some kind of validation on inputs -- regular expression engines simply interpret or convert regular expression patterns/strings into actual runtime operations.

Once you know whether the string (or data) you want to determine to be valid could be tested by a DFA, you have a choice of whether to implement that DFA yourself using your own code or using a regular expression engine. You'll find that knowing about regular expressions will actually enhance your toolbox and your understanding of how string processing can actually get complex.

Based on simple regular expressions you can then look into learning about parsers and how parsers work. At the lowest level you're looking at lexical analysis (where regular expressions work) and at a higher level a grammar and semantic actions. These are the bases upon which compilers and interpreters work, as well as protocol parser implementations, and document rendering/transformation applications rely on.

Dean Michael
+1  A: 

The main concern here is maintainability.

It is obvious to me, that any programmer worth his salt must know regular expressions. Not knowing them is like, say, not knowing what abstraction and encapsulation is, only, probably, worse. So this is out of the question.

On the other hand, one should consider, that maintiaining regex-driven code (written in any language) can be a nightmare even for someone who is really good at them. So, in my opinion, the correct approach here is to only use them when it is inevitable and when the code using regex' will be more readable than its non-regex variant. And, of course, as has been already indicated, do not use them for something, that they are not meant to do (like xml). And no email address validation either (one of my pet peeves :P)!

But seriously, doesn't it feel wrong when you use all those substrs for something, that can be solved with a handful of characters, looking like line noise? I know it did for me.

shylent
At least with "all those substrs" I can split up the problem and have unit tests for each bit. However I think part of the problem is the over 15 years I have been a programmer, I don't have to do string processing (apart from combining for output to a UI) more then 1 or 2 times a year.Abstraction and encapsulation I have to do most days…
Ian Ringrose
I don't think it makes sense only to use regex when it's inevitable (it's never inevitable), only when the cost of their complexity outweighs the benefits of their terseness and power.
John M Gant