tags:

views:

694

answers:

16

My personal experience is that regexs solve problems that can't be efficiently solved any other way, and are so frequently required in a world where strings are as important as they are that not having a firm grasp of the subject would be sufficient reason for me to consider not hiring you as a senior programmer (a junior is always allowed the leeway of training).

However.

A number of responses on the recurrent "What's the regex for this?" type-questions suggest that a great deal of coders find them somewhere between unintelligible and opaque.

This is not about whether a simple indexOf or substring is a better solution, that's a technical matter, and sometimes the simple way is correct, sometimes a regex is, and sometimes neither (looking at you html parser questions).

This is about how important it is to understand Regexs and whether the anti-Regex opinion (that trite "...now they have two problems" thing) is merited or FUD.

Should a programmer should be expected to understand Regexs? Is this a required skill?


edit: just in case it isn't clear, I'm not asking whether I need to learn them (I'm a defender of the faith) but whether the anti-camp have are an evolutionary dead end or whether it's an unnecessary niche skill like InstallShield.

+23  A: 

REs let you solve relatively complex problems that would otherwise require you to code up full parsers with backtracking and all that messy sort of stuff. I liken the use of REs to using chainsaws to chop down a tree instead of trying to do it with a piece of celery.

Once you've learned how to use the chainsaw safely, you'll never go back. People who continue to spout anti-RE propaganda will never be as productive as those of us who have learned to love them.

So yes, you should know how to use REs, even if you understand only the basic constructs. They're a tool just like any other.

paxdiablo
I guess you're the one guy/team who does understand them ... fully
MrTelly
I just got sick of trying to cut the tree down with celery :-)
paxdiablo
On the other hand there are plenty of people who use REs in the same way as using a chainsaw to cut down *celery*. I have seen someone *seriously* suggest using a regex to check whether or not a string was of length 3 (in C#). They then got the regex wrong.
Jon Skeet
"^...$" vs. "len(str) == 3" - yes, they can be abused somewhat :-) I love the fact they got it wrong.
paxdiablo
It’s important to know what’s possible and what’s impossible.
Gumbo
celery - that's a funny one! Seriously though, Jon Skeet is right, but I definitely agree it's very very important for situations where you manipulate a lot of text.
Ray Hidayat
see clarification - knowing *when* is not the question, knowing *at all* is the question
annakata
Good answer, I'd love to see some examples of problems that are solved so much quicker with regexes.
jandersson
@jandersson: Find every number in a string. With regex: "/\d+/g" vs. no regex: "god knows what you'll come up with". After you came up with something: Tweak the condition a little: Find every number with at least three digits in a string. With regex: "/\d{3,}/g" vs. "back to the drawing board"...
Tomalak
+3  A: 

In the Steve Yegge's article, Five Essential Phone Screen Questions, you should read the section "Area Number Three: Scripting and Regular Expressions".

Steve Yegge has some interesting points. He gives real world problems he has encountered with clients having to parse 50,000 files for a particular pattern of a phone number. The applicants who know regular expressions tear through the problem in a few minutes while those who don't write monster multi-hundred line programs that are very unwieldy. This article convinced me I should learn regular expressions.

artknish
If you want to revert my edits you can. I just added a little more info to your post. Good article and thanks for the link!
Simucal
agreed, that's an excellent article
annakata
@Simucal: No problem, in fact, thanks to you! The answer looks better now, I wish there was some way to up-vote good edits :)
artknish
This just points out Jon Skeet's [answer](http://stackoverflow.com/questions/519929/how-important-is-knowing-regexs/519993#519993) where he say that "think of using a regular expression when an actual *pattern* is involved". Finding phone numbers are typical pattern matching. Often when providing basic examples of RegEx use, it is chosen phone number patterns.
awe
A: 

A developer thought he had one problem and tried to solve it using regex. Now he has 2 problems.

And when he solves the first one, he'll solve 'em both THEN be much more productive afterwards. Yes, there are things REs are no good for but, given the right problem, they'll beat the living daylights out of any other solution.
paxdiablo
congratulations for being the first person to wheel out that hackneyed expression I already mentioned disparagingly in the question...
annakata
+2  A: 

Not a brilliant answer but everywhere I've worked the following holds true

0 < Number of people who (fully) understand regex < 1

If I knew how to do it I'd write that previous expression as a regex, but I can't. The best I could come up with on the fly is s/fully/a little/g - that's my limit (and that's probably not a regex).

A more serious answer is that the right regex will solve all kinds of problems, with one(ish) line of code. But you'll have real problems debugging it if it goes wrong. Therefore IMHO a complex regex however 'clean/clever' is a liability, if it takes ten lines of code to replicate it, why's that a problem, is memory/disk space suddenly expensive again?

BTW I'd love to know if regexs are fast compared to code equivalent.

MrTelly
"^0{0,1}\.0*[1-9][0-9]*$"
paxdiablo
regexs are *typically* slower for a simple test, and about equal for a complex replace (they can avoid string concats)
annakata
My point entirely - at the start of the line, for zero or one times escape a fullstop aarrrgghh my heads exploding ....
MrTelly
Memory is not expensive, but time is. If a search takes more than 2 seconds, it will feel slow by the user - even a complex search through a large dataset.
awe
+5  A: 

As a developer you should know the pros and cons of as many tools as possible that could provide pre-made solutions for your problems. Every developer should know how to work with regular expressions and have a feeling when they should be used and when it is besser to use simple string functions to achieve a goal.

Rejecting them outright because they are hard to read is no option in my opinion. A developer who thinks so strips himself of a valuable tool for searching and validating complex string patterns.

Sebastian Dietz
thank you, this is the kind of answer I was looking for
annakata
YES! I think regex is cryptic, but if I recognize a situation where it is better to use it, I seek help to implement it instead of trying to work around it.
awe
+2  A: 

It is not clear what kind of answer you are expecting.

I can imagine roughly three kinds of answer to this question:

  1. Regexen are essential to the education of professional programmers. They enable the use the powerful unix shell tools, and regex-based search-replace can dramatically cut down on text-munging handiwork that is a part of a programmer's life. Programmers that do not know regexen are just intelectually lazy which is a very bad trait for a programmer.

  2. Regexps are kinda useful depending on the application domain. Surely, knowing how to write regexps is a valuable tool a programmer's chest, but most of the time you can do fine without using them. Also, regexps tend to be very hard to read, so abuse must be strongly discouraged.

  3. Some nutcases like to put regexs everything (I'm looking at you, the perl guy who implemented a regex-based tetris in perl). But really, they are just a bit of computer science trivia whose only practical use is in writing parsers. They are widely taught because they make a good teaching topic on which to evaluate students, and like most such topics it can forgotten the second you step out of the exam room.

You will notice the careful use of the plural forms "regexen" (pro), "regexps" (careful neutral) and "regexs" (con).

Personally, I am of the first kind. Good programmers like to learn new languages, and they hate repetitive handiwork.

ddaa
"implemented a regex-based tetris in perl"? I don't even know where I'd begin with that one.
paxdiablo
I'm wondering whether your plural forms are some kind of standard?
annakata
People used to call clusters of VAX machines "VAXen" - not sure where that originated.
paxdiablo
"whose only practical use is in writing parsers" - that's a joke i hope?
Henk Holterman
None of the points above reflect my opinion, which is a nuanced mix of the three. They are exaggerated for the sake of making a point.
ddaa
@Pax: probably from 'oxen', plural of 'ox'.
DisgruntledGoat
+15  A: 

There are some tasks where regular expressions are the best tool to use.
There are some tasks where regular expressions are pointlessly obscure.
There are some tasks where they're reasonably appropriate, but a different approach may be more readable.

In general, I think of using a regular expression when an actual pattern is involved. If you're just looking for a specific string, I wouldn't generally use a regex. As an example of a grey area, someone once asked on a newsgroup the best way to check whether one string contained any of a number of other strings. The two ways which came up were:

  • Build a regex with alternatives and perform a single match.
  • Test each string in turn with string.Contains.

Personally I think the latter way is much simpler - it doesn't require any thought about escaping the strings you're looking for, or any other knowledge of regular expressions (and their different flavours across different platforms).

As an example of somewhere that regular expressions are quite clearly the wrong choice, someone seriously proposed using a regular expression to test whether or not a string three characters long. Their regular expression didn't even work, despite them claiming that the reason they thought of regular expressions first is because they'd been using them for so long, and that they naturally sort of "thought" in regular expressions.

There are, however, plenty of examples where regular expressions really do make life easier - as I say, when you're actually matching patterns: "I want one letter, then three digits, then another letter" or whatever. I don't find myself using regular expressions very often, but when I do use them, they save a lot of work.

In short, I believe it's good to know regular expressions - but equally to be careful about when to use them. It's easy to end up with write-only code which could be made simpler to understand by rewriting with simple string operations, even if the resulting code is slightly longer.

EDIT: In response to the edit of the question...

I don't think it's a good idea to be evangelical about them - in my experience, that tends to lead to using them where an alternative would be simpler, and that just makes you look bad. On the other hand, if you come across someone writing complicated code to avoid using a regular expression, it's fine to point out that a regex would make the code simpler.

Personally I like to comment my regular expressions in quite a detailed way, splitting them up onto several lines with a comment between each line. That way they're easier to maintain, and it doesn't look like you're just trying to be "hard core" geeky (which can be the impression, even if it's not the actual intended aim).

I think the most important thing is to remember that short != readable. Never claim that using a regex is better because it requires less code - claim that it's better when it's genuinely simpler and easier to understand (or where there's a significant performance benefit, of course).

Jon Skeet
I was going to clarify, but I think I'll re-edit into the OP
annakata
Thank you for your balanced response
Brad
The downside with a long good answer with multiple good points, is that I tend to want to up-vote the answer more than once...
awe
+3  A: 

I have really mixed feelings. I have used them and know the bones of the syntax and something in me loves their conciseness. However they are not commonly understood and are a highly obfuscated form of code. I too would like to see performance comparisons against similar operations in plain code. There is no question that the exploded code will be more maintainable and more easily and widely understood, which is a serious consideration in any commercial software project.

Even if they turn out to be more performant, the argument for them taken to its logical conclusion would see us all embedding assembler into our code for important loops - perhaps we should. Neat and concise and very fast, but almost un-maintainable.

On balance I think that until the regex syntax becomes mainstream they probably cause more trouble than they solve and should be used only very carefully.

Simon
I love REs but I have to agree that the syntax is quite awkward, and in order to be understood by developers who do not know them you have to put a lot of effort in documenting them. Maybe it's time for some sort of replacement ^^.
Helper Method
**@Helper Method:** I agree that it's awkward, but introducing a replacement would be hard to get "out there". RE is widely used, and a replacement that is *equally powerful* would probably just be awkward in a different way.
awe
A: 

What does the following do?

"([A-Za-z][A-Za-z0-9+.-]{1,120}:A-Za-z0-9/|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$.+!,;/?:@&~=%-]{0,1000}))?)"

How long did it take you to figure out? to debug?

Regexs are awesome for single-use throwaway programs, but long hairy regexps are not the best choice for programs that other people will need to maintain over the years.

Jose M Vidal
it took me about a minute to read and I'm not sure I can describe what it does in any clearer terms - the pattern it describes is not something that makes any obvious connection with recognisable data to me
annakata
This is a good example for using comments. But debugging would be a mess. By the way: I think you forgot a pair of “[]”.
Gumbo
Writing and reading regular expressions is just like reading/writing programs. So if you write a complex regex, you should document it. This will help others read the expression
f3lix
Gumbo's right, you only have to document what you're trying to achieve with it. Show me the same code in procedural style (your language choice) and see how much of a behemoth it is.
paxdiablo
And I bet it would take me just as long to understand it if you didn't put comments in.
paxdiablo
I think you completely misses the point of regular expressions. You can write huge unmaintainable code in any language. There is no more of a concise way to parse out relevant data than Regular Expressions. If the get unwieldy you can split them into sub-expressions.
Simucal
If not subexpressions, you can also document it by joining pattern parts (substrings) into a meaningful whole... I do the same with hairy SQL expressions.
Andrew Vit
Splitting the regexp up and documenting parts of it is the way to go.
Helper Method
+2  A: 

When you have to parse something (ranging from simple date strings to programming languages) you should know your tools and regular expressions are one of them.

But you should also know what you can do with regexes and what not. At this point it comes in handy if you know the Chomsky hierarchy hierarchy. Otherwise you end up trying to use regular expressions to parse context-sensitive languages and wonder why you can't get your regex right.

f3lix
+1  A: 

The fact that all languages support regexs should mean something !

Learning
+1  A: 

I think knowing a regex is a quite important skill. While the usage of regex in a programming environment/language is question of maintainable code, I find the knowledge of regex to be useful with some commands (say egrep), editors (vim, emacs etc.). Using a regex to do a find and replace in vim is very handy when you have a text file and you want to do some formatting once in a while.

sateesh
+1  A: 

I find it very useful to know regular expressions. They are a very powerful tool, and in my opinion there are problems that you simply can't solve without these.

I would however not take regular expressions as a killing criterion for "hiring you as a senior programmer". They are like the wealth of other tools in the world. You should really known them in a problem domain where you need them, but you cannot presume that someone already knows all of these.

"a junior is always allowed the leeway of training"

If a senior isn't, then I would not hire him!

To the ones that argue how complex and unreadable a regular expression is: If the regexp solution to a problem is complex and unreadable, then probably the problem itself is! Good luck in solving it in an other way...

chiccodoro
wrt the killing criterion: see Srikanth's answer
annakata
A: 

I find that regex's can be very helpful depending on the type of programming that you do. However I probably write less than one regex a month, and because of this long interval between requiring regex's I forget alot about how they work.

I should probably go through mastering regular expressions or something similar someday.

JSmyth
A: 

I agree with pretty much everything said here, and just need to include the mandatory quip:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

(attributed to Jamie Zawinski)

Like most jokes, it contains a kernel of truth.

Evan
see User's answer and comments
annakata
A: 

Knowing when to use a regexp and the basics of how they work and what their limitations are is important. But filling your head with a lot of syntax rules that you probably won't need very often is just a pointless academic exercise.

A regexp crib sheet can be written on one sheet of A4 paper or a couple of pages in a textbook - no need to know this stuff by heart, If you use it every day it will stick. If you don't use it very often then the brain cells are probably better used for something else.

Noel Walters