views:

942

answers:

20

Any code I've seen that uses Regexes tends to use them as a black box:

  1. Put in string
  2. Magic Regex
  3. Get out string

This doesn't seem a particularly good idea to use in production code, as even a small change can often result in a completely different regex.

Apart from cases where the standard is permanent and unchanging, are regexes the way to do things, or is it better to try different methods?

+14  A: 

Obligatory.

It really comes down to the regex. If it's this huge monolithic expression, then yes, it's a maintainability problem. If you can express them succinctly (perhaps by breaking them up), or if you have good comments and tools to help you understand them, then they can be a powerful tool.

Joel Coehoorn
Nothing beats a good comment, even with a simple regular expression. Not all members of my team understand them so a good comment explaining what it is doing (sometimes with a key) is invaluable for maintenance.
Jeff Yates
+3  A: 

Regex's aren't the ONLY way to do something. You can do logically in code everything that a regular expression can. Regular expressions are just

  1. Fast
  2. Tested and Proven
  3. Powerful
Darren Kopp
Regular expressions are actually quite expensive..but you are right, they are powerful.
camflan
Regular expressions are expensive until you start using them multiple times. Anything dealing with strings is expensive, but a regular expression will probably work better than looping through each string, seeing if it contains text, then doing the next thing you want to match on.
Darren Kopp
expensive how? If you only need a match/no match response then they're O(N), otherwise they can be exponential, but so would the equivalent non-RE way of searching for the same thing: http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
tloach
+5  A: 

Complex regexes are fire-and-forget for me. Write it, test it, and when it works, write a comment what it does and we're fine.

In many cases, however, you can breakdown regular expressions to smaller parts, maybe write some well-documented code that combines these regexes. But if you find a multi-line regex in your code, you better be not the one who must maintain it :)

Sounds familiar? That's more or less true of any code. You don't want to have very long methods, you don't want to have very long classes, and you don't want to have very long regular expressions, though methods and classes are by far easier to refactor. But in essence, it's the same concept.

OregonGhost
+2  A: 

famous quote about regexes:

"Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems." -- Jamie Zawinski

When I do use regexes, I find them to be maintainable, but they are used in special cases. There is usually a better, non-regex method for doing almost everything.

camflan
I like the quote :)
OregonGhost
+6  A: 

It only seems like magic if you don't understand the regex. Any number of small changes in production code can cause major problems so that is not a good reason, in my opinion, to not use regex's. Thorough testing should point out any problems.

DMKing
>> It only seems like magic if you don't understand the regex. << I think that's the point of Rich's question. Complex regex strings can be very opaque and difficult to understand, not to mention debug.
Michael Burr
Agree with Mike B. Some coding god may immediately understand a page-long regex, but the power of the regex comes with a price for most regular developers :)
OregonGhost
I think the underlying point that you're getting at here is most programmers don't understand regexes very well. They're an extremely important tool. This is an educational deficiency, not a coding deficiency.
rmeador
@Mike: The same can be said of any complex code. The difference is the developers are trained to understand the code. They also need to be trained to understand the regex's, it's a similar skill so it shouldn't be too difficult.
tloach
+2  A: 

When used consciously regular expressions are a powerful mechanism that spares you from lines and lines of possible text parsing. They should of course be documented correctly and efficiently tracked in order to verify if initial assumptions are still valid and otherwise updated them accordingly. Regarding maintenance IMHO is better to change a single line of code (the regular expression pattern) than understand lines and lines of parsing code or whatever the regular expressions purpose is.

smink
+22  A: 

If regexes are long and impenetrable, making them hard to maintain then they should be commented.

A lot of regex implementations allow you to pad regexes with whitespace and comments.
See http://www.regular-expressions.info/comments.html
and Coding Horror: Regular Expressions: Now You Have Two Problems

Any code I've seen that uses Regexes tends to use them as a black box:

If by black box you mean abstraction, that's what all programming is, trying to abstract away the difficult part (parsing strings) so that you can concentrate on the problem domain (what kind of strings do I want to match).

even a small change can often result in a completely different regex.

That's true of any code. As long as you are testing your regex to make sure it matches the strings you expect, ideally with unit tests, then you should be confident at changing them.

Edit: please also read Jeff's comment to this answer about production code.

Sam Hasler
indeed ! re: unit tests. They are living, breathing documentation of intent.
Michael Easter
changing them in production code should NEVER make you feel comfortable.Rather, they should be changed on your test server (which should always be identical to your production server, except where your test code is different), tested, and pushed to your prod server.Changing prod code: Bad, Mkay?
Jeff
+6  A: 

Small changes to any code in any language can result in completely different results. Some of them even prevent compilation.

Substitute regex with "C" or "C#" or "Java" or "Python" or "Perl" or "SQL" or "Ruby" or "awk" or ... anything, really, and you get the same question.

Regex is just another language, Huffman coded to be efficient at string matching. Just like Java, Perl, PHP, or especially SQL, each language has strengths and weaknesses, and you need to know the language you're writing in when you're writing it (or maintaining it) to have any hope of being productive.

Edit: Mike, regex's are Huffman coded in that common things to do are shorter than than rarer things. Literal matches of text is generally a single character (the one you want to match). Special characters exist - the common ones are short. Special constructs, such as (?:) are longer. These are not the same things that would be common in general-purpose languages like Perl, C++, etc., so the Huffman coding was targetted at this specialisation.

Tanktalus
Exactly what I would have written... so "just" a up-vote.
Anheledir
What does Huffman coding have to do with regular expressions?
Michael Burr
A: 

I use them in my apps but I keep the actual regEx expression in the configuration file so if the source text I'm parsing (an email for example) changes format for some reason I can quickly update the config to handle the change without re-building the app.

Ron

Ron Savage
+8  A: 

I don't know which language you're using, but Perl - for example - supports the x flag, so spaces are ignored in regexes unless escaped, so you can break it into several lines and comment everything inline:

$foo =~ m{
    (some-thing)          # matches something
    \s*                   # matches any amount of spaces
    (match another thing) # matches something else
}x;

This helps making long regexes more readable.

jkramer
Python does this implicitly :)
camflan
To someone that knows regexes, those comments are equivalent to "i++; // Adds one to i
Zan Lynx
I doubt jkramer was suggesting those as the exact comments, but merely pointing out the ability to do that. (taking examples too literally)--
Tanktalus
Well yes, but I do see code, Perl especially, commented in this way. Instead of using comments explaining regex basics or using "unless" after a keyword, or using short-circuit evaluation of "or", people need to learn Perl syntax.
Zan Lynx
Perl 6 does/will do this implicitly.
Brad Gilbert
A: 

Regex has been referred to as a "write only" programming language for sure. However, I don't think that means you should avoid them. I just think you should comment the hell out of their intent. I'm usually not a big fan of comments that explain what a line does, I can read the code for that, but Regexs are the exception. Comment everything!

WaldenL
+2  A: 

Are regexes the way to do things? It depends on the task.

As with all things programming, there isn't a hard and fast right, or wrong answer.

If a regexp solves a particular task quickly and simply, then it's possibly better then a more verbose solution.

If a regexp is trying to achieve a complicated task, then something more verbose might be simpler to understand and therefore maintain.

SpoonMeiser
+1  A: 

I have a policy of thoroughly commenting non-trivial regexes. That means describing and justifying each atom that doesn't match itself. Some languages (Python, for one) offer "verbose" regexes that ignore whitespace and allow comments; use this whenever possible. Otherwise, go atom by atom in a comment above the regex.

skymt
A: 

I usually go to the extent of writing a scanner specification file. A scanner, or "scanner generator" is essentially an optimized text parser. Since I usually work with Java my preferred method is JFlex (http://www.jflex.de), but there is also Lex, YACC, and several others.

Scanners work on regular expressions that you can define as macros. Then you implement callbacks when the regular expressions match part of the text.

When it comes to the code I have a specification file containing all the parsing logic. I run it through the scanner generator tool of choice to generate the source code in the language of choice. Then I just wrap all that into a parser function or class of some sort. This abstraction then makes it easy to manage all the regular expression logic, and it is very good performance. Of course, it is overkill if you are working with just one or two regexps, and it easily takes at least 2-3 days to learn what the hell is going on, but if you ever work with, say, 5 or 6 or 30 of them, it becomes a really nice feature and implementing parsing logic starts to only take minutes and they stay easy to maintain and easy to document.

Josh
+1  A: 

The problem is not with the regexes themselves, but rather with their treatment as a black box. As with any programming language, maintainability has more to do with the person who wrote it and the person who reads it than with the language itself.

There's also a lot to be said for using the right tool for the job. In the example you mentioned in your comment to the original post, a regex is the wrong tool to use for parsing HTML, as is mentioned rather frequently over on PerlMonks. If you try to parse HTML in anything resembling a general manner using only a regex, then you're going to end up either doing it in an incorrect and fragile manner, writing a horrendous and unmaintainable monstrosity of a regex, or (most likely) both.

Dave Sherohman
+2  A: 

There are a lot of possibilities to make RegEx more maintainable. In the end it's just a technique a (good?) programmer has to learn when it comes to major (or sometimes even minor) changes. When there weren't some really good pro's no one would bother with them because of their complex syntax. But they are fast, compact and very flexible in doing their job.

For .NET People there could be the "Linq to RegEx" library worse a look or "Readable Regular Expressions Library". It makes them more easy to maintain and yet easier to write. I used both of them in own projects I knew the html-sourcecode I analysed with them could change anytime.

But trust me: When you cotton on to them they could even make fun to write and read. :)

Anheledir
A: 

I've always approached this issue as a building-block problem.

You don't just write some 3000 character regex and hope for the best. You write a bunch of small chunks that you add together.

For example, to match a URI, you have the protocol, authority, subdomain, domain, tld, path, arguments (at least). And some of these are optional!

I'm sure you could write one monster to handle it, but it's easier to write chunks and add them together.

warren
+2  A: 

RegExs can be very maintainable, if you utilize new features introduced by Perl 5.10. The features I refer to are back-ported features from Perl 6.

Example copied directly from perlretut.

Defining named patterns

Some regular expressions use identical subpatterns in several places. Starting with Perl 5.10, it is possible to define named subpatterns in a section of the pattern so that they can be called up by name anywhere in the pattern. This syntactic pattern for this definition group is (?(DEFINE)(?<name>pattern)...). An insertion of a named pattern is written as (?&name).

The example below illustrates this feature using the pattern for floating point numbers that was presented earlier on. The three subpatterns that are used more than once are the optional sign, the digit sequence for an integer and the decimal fraction. The DEFINE group at the end of the pattern contains their definition. Notice that the decimal fraction pattern is the first place where we can reuse the integer pattern.

/^
  (?&osg)\ * ( (?&int)(?&dec)? | (?&dec) )
        (?: [eE](?&osg)(?&int) )?
 $
 (?(DEFINE)
     (?<osg>[-+]?)         # optional sign
     (?<int>\d++)          # integer
     (?<dec>\.(?&int))     # decimal fraction
 )
/x
Brad Gilbert
+1  A: 

I commonly split up the regex into pieces with comments, then put them all together for the final push. Pieces can be either substrings or array elements

Two PHP PCRE examples (specifics or the particular use are not important):

1)
  $dktpat = '/^[^a-z0-9]*'. // skip any initial non-digits
    '([a-z0-9]:)?'. // division within the district
    '(\d+)'. // year
    '((-)|-?([a-z][a-z])-?)'. // type of court if any - cv, bk, etc.
    '(\d+)'. // docket sequence number
    '[^0-9]*$/i'; // ignore anything after the sequence number
  if (preg_match($dktpat,$DocketID,$m)) {

2)
    $pat= array (
      'Row'        => '\s*(\d*)',
      'Parties'    => '(.*)',
      'CourtID'    => '<a[^>]*>([a-z]*)</a>',
      'CaseNo'     => '<a[^>]*>([a-z0-9:\-]*)</a>',
      'FirstFiled' => '([0-9\/]*)',
      'NOS'        => '(\d*)',
      'CaseClosed' => '([0-9\/]*)',
      'CaseTitle'  => '(.*)',
    );
    // wrap terms in table syntax
    $pat = '#<tr>(<td[^>]*>'.
      implode('</td>)(</tr><tr>)?(<td[^>]*>',$pat).
      '</td>)</tr>#iUx';
    if (preg_match_all ($pat,$this->DocketText,$matches, PREG_PATTERN_ORDER))
+1  A: 

Your question doesn’t seem to pertain to regular expressions themselves, but only the syntax generally used to express regular expressions. Among many hardcore coders, this syntax has come to be accepted as pretty succinct and powerful, but for longer regular expressions it is actually really unreadable and unmaintainable.

Some people have already mentioned the “x” flag in Perl, which helps a bit, but not much.

I like regular expressions a lot, but not the syntax. It would be nice to be able to construct a regular expression from readable, meaningful method names. For example, instead of this C# code:

foreach (var match in Regex.Matches(input, @"-?(?<number>\d+)"))
{
    Console.WriteLine(match.Groups["number"].Value);
}

you could have something much more verbose but much more readable and maintainable:

int number = 0;
Regex r = Regex.Char('-').Optional().Then(
    Regex.Digit().OneOrMore().Capture(c => number = int.Parse(c))
);
foreach (var match in r.Matches(input))
{
    Console.WriteLine(number);
}

This is just a quick idea; I know there are other, unrelated maintainability issues with this (although I would argue they are fewer and more minor). An extra benefit of this is compile-time verification.

Of course, if you think this is over the top and too verbose, you can still have a regular expression syntax that is somewhere in between, perhaps...

instead of:   -?(?<number>\d+)
could have:   ("-" or "") + (number = digit * [1..])

This is still a million times more readable and only twice as long. Such a syntax can easily be made to have the same expressive power as normal regular expressions, and it can certainly be integrated into a programming language’s compiler for static analysis.

I don’t really know why there is so much opposition to rethinking the syntax for regular expressions even when entire programming languages are rethought (e.g. Perl 6, or when C# was new). Furthermore, the above very-verbose idea is not even incompatible with “old” regular expressions; the API could easily be implemented as one that constructs an old-style regular expression under the hood.

Timwi