views:

981

answers:

24

It strikes me that regular expressions are not understood well by the majority of developers. It also strikes me that for a lot of problems where regular expressions are used, a lump of code could be used instead. Granted, it might be slower and be 20 lines for something like email validation, but if performance of the code is not desperately important, is it reasonable to assume that not using regular expressions might be better practise?

I'm thinking in terms of maintenance of the code rather that straight line execution time.

+41  A: 

Maintaining one regular expression is a lot less effort than maintaining 20 lines of code. And you underestimate the amount of code needed - for a regex of any complexity, the replacement code could easily be 200 rather than 20 lines.

anon
don't forget one has to decipher what the regex is doing. If one can read and understand those 20 lines as easy as reading an essay, then it wouldn't be a problem.
ghostdog74
+1 for 200 LOC. Regexes can abstract away *extremely* complex state machines.
Rex M
But regular expressions can also get extremely complex. Use the right tool for the right job: http://blogs.msdn.com/oldnewthing/archive/2006/05/22/603788.aspx
Romulo A. Ceccon
ghostdog74: this applies to any language - the problem is people are too inclined to shove complex regexes into single lines with no comments/spacing - you don't do that with other languages, so don't do it with regex!
Peter Boughton
@Peter Boughton: I think the part 1 of reason for that is the same as for code golfing - trying to be cool. Part 2 of the reason is that regex knowledge comes in a wide range. Things that are obvious easy-peasy for one guy can look like an incomprehensible mess to somebody else.
Tomalak
Yeah I know - it's just a pity that people's only exposure tends to be from the trying to be cool crowd. Part 2 would be no worse than regular code if more programmers were taught regex* as a basic skill, instead of it being treated as esoteric cryptic gobbledygook.(*and other 'tool' languages/technologies)
Peter Boughton
+1  A: 

Due to the type of apps I build, the only RegEx's I regularly use are for email validation, html stripping, and character stripping to remove the garbage around phone numbers.

It's rare that I need to do very much string manipulation other than concatenation.

Incidentally, the apps are typically CRM's.

So the hassle for me is limited to googling for a regex in the event I find myself in need. ;)

Chris Lively
+4  A: 

You raise a very good point with regards to maintainability. Regular expressions can require some deciphering to understand but I doubt the code which would replace them would be easier to maintain. Regular Expressions are VERY powerful and a valuable tool. Use them but use them carefully, and think about how to make it clear what the intent of the regular expression is.

Regards

Howard May
+17  A: 

Whenever i use a Regex i always try to leave a comment explaining exactly how it's structured because I agree with you that not all developers understand them and going back to a regex, even if you've written it yourself, can be a headache to understand again.

That said, they definitely have their uses. Try stripping out all html elements from a box of text without it!

Fermin
Problem is - it's hard to find the sweet spot of explaining enough without being too verbose. Complex regex can beat your ability to fully explain them, PLUS when you change the regex you'll have to adapt a lengthy explanation. Apart from that, people without any regex understanding will not be able to follow you no matter how hard you try to explain.
Tomalak
The secret is explaining WHAT the regex matches (this part matches the prefix... now this matches the name... we can repeat prefix + name three to five times...) and not HOW it matches. I can write really long regexes and still be satisfied with their readability and maintainability.
Massa
This is only doable as long as your regex is of the "matches some stuff/quite straight-forward" type. As soon as it gets dirty with lots of look-ahead/look-behind, nesting, lazy/possessive quantifiers, different evaluation paths and so on, explaining WHAT it matches will not be enough anymore.
Tomalak
You are forgetting about "Verbose regexes". They are supported by most languages and are a real boon (see for example http://www.diveintopython.org/regular_expressions/verbose.html)
exhuma
If you mean "HTML tags", yes, that's probably doable (unless there are escaped `<` or `>` inside some tags). Elements---use a parser.
Svante
+1  A: 

I see regex as a fast, readable and preferable way to perform pattern matching on string data. So many languages support regex for this reason. If you wanted to write string manipulation code to match say, a Canadian zip code, be my guest, but the regex equivalent is so much more succinct. Definitely worth it.

Ash Machine
+9  A: 

I'm thinking in terms of maintenance of the code rather that straight line execution time.

Code size is the single most important factor in reducing maintainability.

And while Regexps can be very hard to decipher, so are 50 line string processing methods - and the latter are more likely to contain bugs in rare corner cases.

The thing is: any non-trivial regexp must be commented just as thoroughly as you'd comment a 50 line method.

Michael Borgwardt
In terms of maintenance, unit testing is a great boon where complex regexes are concerned. Rather than embed complex regexes in the middle of your program logic, extract them into separate class/functions and add unit tests. That reduces the maintenance overhead considerably for other developers and also provides useful documentation as to what the regex actually does and where the corner cases are.
the_mandrill
+31  A: 

Professional developers should be familiar with basic syntax

At the very least. In all the years long I've been a professional developer I haven't come across a developer that wouldn't know what Regular Expressions are. It's true, not everybody likes using them or is very good at knowing its syntax, but that doesn't mean one shouldn't use them. Developers should learn the syntax and regular expressions should be used.

It's like: "Ok. We have Lambda expressions, but who cares, I can still do it the old fashioned way."

Not learning key aspects of professional development is pure laziness and shouldn't be tolerated for too long.

Robert Koritnik
Absolutely. I don't think there's any excuse to not be familiar with the basic syntax.
Skilldrick
Too long? Shouldn't be tolerated, full stop.
Kirk Broadhurst
+2  A: 

With great power comes great responsibility!

Regular expressions are great, but there can be a tendancy to over-use them! There are not suitable in all cases!

Stevo3000
+1 - also: "People fear what they do not understand." Most challenging things in modern society should be approached with a combination of the Spiderman- and X-Men-principles.
David Berger
A: 

Read the section under "Using Benchmarks" at JavaWorld.

Sure regular expressions are a very helpful tool, but I agree that they are overused and over complicate what can easily be a simple solution.

That being said, you should use regular expressions whenever the situation calls for it. Some things, such as searching for text in a string, can just as easily be done with an iterative search (or using the API searches), but for more complex situations you need regular expressions.

amischiefr
Searching for text in a string is exactly what regular expressions are good for (in some situations, something like Knuth-Morris-Pratt or Boyer-Moore may be better).
Svante
A: 

Surly all code needs to be optimized where possible!

In the context where code need not be optimized, and the logic will need to be maintained then it is down to the skill set of the team.

If the bulk of the team responsible for the code is regEX savvy then do it with a regEX. Else write it in the way the team is likely to be most comfortable with.

abe
I would argue that your team member need to learn regular expressions. They're *that* essential.
Stuart Branham
No, code needs to be optimized only where you have proven that it matters. If this were not the case we would all be coding in assembler.
Chas. Owens
i hope a day will come when my code will never need refactoring and every member of my team knows and understands every aspect of programming ;)
abe
+1  A: 

In .NET regex'es you can have comments, and break them up into multiple lines, use indenting etc. (I don't know about other dialects...)

Use the "ignore pattern whitespace" setting, and either # for commenting out the rest of the line, or "(#comments)" in your pattern...

So if you wanted to, you can actually make them sort of readable/maintainable...

Arjan Einbu
A: 

I just ran into this issue. I built a regular expression to pull out groups of data from a long string of numbers and some other noise. The regex was quite long, though concise, and it got even bigger when i tried to add it to the C# app i was writing. In total the reg ex was 3 lines of code.

However it was painful to look at after i escaped it for C# and the other developers i work with don't under stand regular expressions. I ended up stripping out most of the noise characters and splitting on space to get the groups of data. Very simple code and only 5 lines.

Which is better? My ego says Regular Expressions. Any new hire would say character stripping.

JustSmith
A: 

VB.net is best, No, C# is, No F# is the best. It's really more a matter of what will be the people maintaining be better suited to handle, in my opinion. That's more a flame question, than something that is absolutely answerable.

Personally I'd choose regex whenever there's complex string validation (phone numbers,emails, ss#, ip addresses) where there are well known regex's out there. Get it from regex.org, give attribution with a comment and/or get the authors permission whichever is appropriate, and be done with it.

Also, for extracting pieces of a string, or complex splitting of strings, regex can be a great time saver.

But if you're writing your own, rather than using someone else's, using something like regex buddy or sells brothers regexdesigner is a must for testing and validation.

gjutras
A: 

I would never wish for fewer options in programming. Regular expressions can be very powerful, but do require skill. I like problems that can be solved in a few lines of code. It is really cool how many elements of validation can be accomplished. As long as the code is commented on what the expression checks for, I do not see a problem. I also have never seen a professional programmer not know what a regex was. It is another tool in the tool box.

Troggy
+2  A: 

In my opinion, it might make more sense to enforce better practices with using regular expressesions other than forgoing it all together.

  • Always comment your regular expressions. You might know what it does now, but someone else might not and even you might not remember in two weeks. Moreover, descriptive comments should be used, stating exactly what the regular expression is meant to do.
  • Use unit testing. Create unit tests for your regular expressions. So can have a degree of assurance as to the reliability and correctness of your regular expression statement. And if the regex is being maintained, it would ensure that any code changes does not break existing functionality.

Using regular expression has some advantages:

  • Time. You don't have to write your own code to do exactly what is built in.
  • Maintainability. You have to maintain only a couple of lines as opposed to 30 or 300
  • Performance. The code is optimized
  • Reliability. If your regex statement is correct, it should function correctly.
  • Flexibility. Regex gives you a lot of power which is very useful if used properly
Aidan
A: 

Regex is one tool among many. But as many craftsmen will attest, the more tools you have at your disposal, and the more skilled you are at using them, the more likely you will become a Master Craftsman.

Is Regex worth the hassle to you? Dunno. Depends how seriously you take what you do.

MikeB
+1  A: 

I would just like to add that unit testing is the ideal way to make your regular expressions maintainable. I consider Regex an essential developer skill that is always a practical alternative to writing many lines of string manipulation code.

apathetic
documentation through unit-tests!
Michael Paulukonis
+6  A: 

Regular expressions are a domain-specific language: no generic programming language is quite as expressive or quite as efficient at doing what regular expressions do with string matching. The sheer size of the lump of code you will have to write in a standard programming language (even one with a good string library) will make it harder to maintain. It is also a good separation-of-concerns to make sure that the regular expression only does the matching. Having a code blob that basically does matching, but does something else in-between can produce some surprising bugs.

Also note that there are mechanisms to make regular expressions more readable. In Python you can enable verbose mode, which allows you to write things like this:

a = re.compile(r"""\d +  # the integral part
               \.    # the decimal point
               \d *  # some fractional digits""", re.X)

Another possibility is to build the regular expression up from strings, by line and comment each line, like this:

a = re.compile("\d+"  # the integral part
               "\."    # the decimal point
               "\d *"  # fraction  digits
               )

This is possible in different ways in most programming languages. My advice is to keep using regular expressions where appropriate, but treat them like you do other code. Write them as clear as possible, comment them and test them.

Joakim Lundborg
+1  A: 

It's a lot easier to see at first glance that a regex is probably correct. Why would I write a long state machine in code (probably containing bugs at first) when I could write a simple one line regex?

Regexes may be considered "write only", but I think that is sometimes a benefit. When writing a relatively simple regex from scratch, it's pretty easy to get it right.

Zifre
+1  A: 

True, learning to decipher regexes is difficult -- but so is learning to decipher the hosting program code in the first place. But is that so difficult, that we would rather write out manual instruction for a person to perform? No -- because that would be ridiculously longer and complicated. Same thing for not using a properly-formed regex.

Michael Paulukonis
+1  A: 

I've found with reg ex it's easier to maintain, but fine tuning someone else's reg ex is a bit of a pain. I think you underestimate the developers by saying most people don't understand it. Usually what I found is that over time, requirements adjust, and the regex that used to validate something is no longer effective and attempting to remove portions that are no longer valid is harder than to just rewrite the entire thing.

Also, imagine if you were validating phone numbers, and you decided to use code instead of reg ex. So it amounts to let's say 20 lines. Over time, your company decides to expand to other regions where now the phone validation is no longer totally true. So you have to adjust it to fit the other requirements. It could be possible that the code would be harder to maintain because you have to adjust over 20 lines of code rather than simply removing the old reg ex, and replacing it with a new one.

However, I think code can be used in certain cases along with regex. For example, let's say you want to validate US phone numbers, in every case, it has 10 digits numbers, but there are literally a ton of ways to write it out. For example (xxx) xxx-xxxx, or xxx-xxx-xxxx, or xxx xxx xxxx, etc, etc, etc. So if you write reg ex, you'd have to account for each of the cases. However, if you just strip all non-numerics and spaces with a regex replace, then go for a second pass and check if it has 10 digits, you'd find it easier than accounting each and every possible way to write a phone number.

Daniel
+2  A: 

Think of regular expressions as the lingua Franca of string processing. You simply need to know them if you are going tocode in a professional capacity. Unless you just write SQL maybe.

Stuart Branham
A: 

No, regexes are not worth the hassle. As David Hasselhoff says, don't hassle the hoff!

Kinopiko
Would you care to explain why you think regexes aren't worth it?
Neil Aitken
+1  A: 

The most hassle I see is when people try to parse non-regular languages with regular expressions (yes, that includes all programming and many markup languages, yes, also HTML). I sometimes wish all coders had to demonstrate that they have understood at least the difference between context-free and regular languages before they are allowed to use regular expressions. Alternatively, they could get their regex license revoked when they are caught trying to parse non-regular languages with them. Yes, I'm joking, but only half.

The next problem arises when people try to do more than character matching in a regular expression, for example, checking for a valid date, perhaps even including leap year considerations (this could also lead to regex license revokation).

Regular expressions really are just a convenient shorthand for a finite state automaton (You know what that is, don't you? Where is your regex license, please?). The problems come from people expecting some kind of magic from them, not from the regular expressions themselves.

Svante