tags:

views:

849

answers:

22

Someone I know has been telling me that RegEx should be avoided, as it is heavyweight or involves heavy processing. Is this true? This made a clap in my ears, ringing my eardrums up until now.

I don't know why he told me that. Could it have been from experience or merely 3rd-hand information (you know what I mean...)?

So, stated plainly, why should I avoid regular expressions?

I want information from the masters in the SO community to share their ideas with me. Thanks guys!

+12  A: 

Overhyped? No. They're extremely powerful and flexible.

Overused? Absolutely. Particularly when it comes to parsing HTML (which frequently comes up here).

This is another of those "right tool for the job" scenarios. Some go too far and try to use it for everything.

You are right though in that you can do many things with substring and/or split. You will often reach a point with those where what you're doing will get so complicated that you have to change method or you just end up writing too much fragile code. Regexes are (relatively) easy to expand.

But hand written code will nearly always be faster. A good example of this is Putting char into a java string for each N characters. The regex solution is terser but has some issues that a hand written loop doesn't and is much slower.

cletus
Or really in any kind of activity that can be called "parsing".
Greg Hewgill
A compiled (well-written) regex actually tends to be extremely fast. It's just a state-machine. A lot of the speed issues I think can be chalked up to people not understanding that there can be a fairly sizeable penalty for transforming the string representation of a regex to a compiled regex.
Actually the Perl Regex engine is faster than if you wrote the routine yourself, for all but the simplest of cases. This of course assumes that the Regex was well designed to begin with.
Brad Gilbert
@cletus: I recovered this answer by merging from a deleted question. You may want to adjust your wording slightly to fit this question.
Bill the Lizard
Many thanks, Bill.
cletus
+3  A: 

If more people knew how to use a decent parser generator, there would be fewer people using regular expressions.

+4  A: 

Overhyped? No

Under-Utilized Properly? Yes

cpjolicoeur
I recovered this answer from a deleted question that was the same, but worded slightly differently. You may want to adjust the wording of your answer to match.
Bill the Lizard
+5  A: 

I think that if you learn programming in language that speaks regular expressions natively you'll gravitate toward them because they just solve so many problems. IE, you may never learn to use split because regexec() can solve a wider set of problems and once you get used to it, why look anywhere else?

On the other hand, I bet C and C++ programmers will for the most part look at other options first, since it's not built into the language.

dicroce
+6  A: 

"When you have a hammer, everything looks like a nail."

Regular expressions are a very useful tool; but I agree that they're not necessary for every single place they're used. One positive factor to them is that because they tend to be complex and very heavily used where they are, the algorithms to apply regular expressions tend to be rather well optimized. That said, the overhead involved in learning the regular expressions can be... high. Very high.

Are regular expressions the best tool to use for every applicable situation? Probably not, but on the other hand, if you work with string validation and search all the time, you probably use regular expressions a lot; and once you do, you already have the knowledge necessary to use the tool probably more efficiently and quickly than any other tool. But if you don't have that experience, learning it is effectively a drag on your productivity for that implementation. So I think it depends on the amount of time you're willing to put into learning a new paradigm, and the level of rush involved in your project. Overall, I think regular expressions are very worth learning, but at the same time, that learning process can, frankly, suck.

McWafflestix
+3  A: 

In my belief, they are overused by people quite a bit (I've had this discussion a number of times on SO).

But they are a very useful construct because they deliver a lot of expressive power in a very small piece of code.

You only have to look at an example such as a Western Australian car registration number. The RE would be

re.match("[1-9] [A-Z]{3} [0-9]{3}")

whilst the code to check this would be substantially longer, in either a simple 9-if-statement or slightly better looping version.

I hardly ever use complex REs in my code because:

  • I know how the RE engines work and I can use domain knowledge to code up faster solutions (that 9-if variant would almost certainly be faster than a one-shot RE compile/execute cycle); and
  • I find code more readable if it's broken up logically and commented. This isn't easy with most REs (although I have seen one that allows inline comments).

I have seen people suggest the use of REs for extracting a fixed-size substring at a fixed location. Why these people don't just use substring() is beyond me. My personal thought is that they're just trying to show how clever they are (but it rarely works).

paxdiablo
The substring() example is quite true, I also don't understand why some people insist on using regex's all the time.
Alix Axel
+1  A: 

Regular Expressions are one of the most useful things programmers can learn, they allow to speed up and minimize your code if you know how to handle them.

Alix Axel
+2  A: 

There is a very good reason to use regular expressions in scripting languages (such as Ruby, Python, Perl, JavaScript and Lua): parsing a string with carefully optimized regular expression executes faster than the equivalent custom while loop which scans the string character-by-character. For compiled languages (such as C and C++, and also C# and Java most of the time) usually the opposite is true: the custom while loop executes faster.

One more reason why regular expressions are so popular: they express the programmer's intention in an extremely compact way: a single-line regexp can do as much as a 10- or 20-line while loop.

pts
+1  A: 

Regular expressions are often easier to understand than the non-regex equivalent, especially in a language with native regular expressions, especially in a code section where other things that need to be done with regexes are present.

That doesn't meant they're not overused. The only time string.match(/\?/) is better than string.contains('?') is if it's significantly more readable with the surrounding code, or if you know that .contains is implemented with regexes anyway

singpolyma
+1  A: 

I often use regex in my IDE to quick fix code. Try to do the following without regex.

glVector3f( -1.0f, 1.0f, 1.0f ); -> glVector3f( center.x - 1.0f, center.y + 1.0f, center.z + 1.0f );

Without regex, it's a pain, but WITH regex...

s/glVector3f\((.*?),(.*?),(.*?)\)/glVector3f(point.x+$1,point.y+$2,point.z+$3)/g

Awesome.

Stefan Kendall
+2  A: 

Overhyped? No, if you have ever taken a Parsing or Compiler course, you would understand that this is like saying addition and multiplication is overhyped for math problems.

It is a system for solving parsing problems.

some problems are simpler and don't require regular expressions, some are harder and require better tools.

Unknown
@Unknown: I recovered this answer by merging from a deleted question. You may want to adjust your wording slightly to fit this question.
Bill the Lizard
+1  A: 

I'd agree that regular expressions are sometimes used inappropriately. Certainly for very simple cases like what you're describing, but also for cases where a more powerful parser is needed.

One consideration is that sometimes you have a condition that needs to do something simple like test for presence of a question mark character. But it's often true that the condition becomes more complex. For example, to find a question mark character that isn't preceded by a space or beginning-of-line, and isn't followed by an alphanumeric character. Or the character may be either a question mark or the Spanish "¿" (which may appear at the start of a word). You get the idea.

If conditions are expected to evolve into something that's less simple to do with a plain call to String.contains("?"), then it could be easier to code it using a very simple regular expression from the start.

Bill Karwin
+1  A: 

It comes down to the right tool for the job.

I usually hear two arguments against regular expressions: 1) They're computationally inefficient, and 2) They're hard to understand.

Honestly, I can't understand how either are legitimate claims.

1) This may be true in an academic sense. A complex expression can double back on itself may times over. Does it really matter though? How many millions of computations a second can a server processor do these days? I've dealt with some crazy expressions, and I've never seen a regexp be the bottle neck. By far it's DB interaction, followed by bandwidth.

2) Hard for about a week. The most complicated regexp is no more complex than HTML - it's just a familiarity problem. If you needed HTML once every 3 months, would you get it 100% each time? Work with them on a daily basis and they're just as clear as any other language syntax.

I write validation software. REGEXP's are second nature. Every fifth line of code has a regexp, and for the life of me I can't understand why people make a big deal about them. I've never seen a regexp slow down processing, and I've seen even the most dull 'programmers' pick up the syntax.

Regexp's are powerful, efficient, and useful. Why avoid them?

rooskie
+25  A: 

Don't avoid them. They're an excellent tool, and when used appropriately can save you a lot of time and effort. Moreover, a good implementation used carefully should not be particularly CPU-intensive.

Shog9
personally, i like RegEx, saves you a lot of code (and time) when validating text inputs. it might be wiser to sacrifice CPU time for regex than shelling out code (which is bug prone)...
jerbersoft
Right. If you've spent the last twenty years writing parsers, to where you can now write a flawless "long-hand" equivalent to any regex in minutes (with one arm tied behind your back, while blindfolded...) Then by all means, don't bother with them. But for most of us, writing a regular expression is faster than writing the equivalent parsing code, even if we have to look up the syntax while doing so! And even a moderately complicated expression is easier to understand than two pages of nested switch statements...
Shog9
@Shog9: Thanks for the heads up on the duplicate that was deleted. I think the wording of that question was its downfall. The answers are definitely worth salvaging, so I merged them in.
Bill the Lizard
Thanks much, Bill!
Shog9
+3  A: 

Don't avoid it, but ask youself if they're the best tool for the task you have to solve. Maybe sometimes regex are difficult to use or debug, but they're really usefull in some situations. The question is to use the apropiate tool for each task, and usually this is not obvious.

Jonathan
+15  A: 

If you can easily do the same thing with common string operations, then you should avoid using a regular expression.

In most situations regular expressions are used where the same operation would require a substantial amount of common string operations, then there is of course no point in avoiding regular expressions.

Guffa
Sounds like common sense but people seem to forget this.
xenon
What is the rationale? Why would a precompiled re on a good compiler be much slower than a string operation?
ilya n.
"Common sense is not so common" - Voltaire ;)
Guffa
The cache for compiled regex's is limitied, so the more you have of them, the more often they need recompiling. Even when the regex is already compiled, there is still some overhead.
Guffa
+8  A: 

As a basic parser or validator, use a regular expression unless the parsing or validation code you would otherwise write would be easier to read.

For complex parsers (i.e. recursive descent parsers) use regex only to validate lexical elements, not to find them.

The bottom line is, the best regex engines are well tuned for validation work, and in some cases may be more efficient than the code you yourself could write, and in others your code would perform better. Write your code using handwritten state machines or regex as you see fit, but change from regex to handwritten code if performance tests show you that regex is significantly inefficient.

+1 for pointing out that regex is often not the right solution for complex parsers
John M Gant
+5  A: 

You know, given the fact that I'm what many people call "young", I've heard too much criticism about RegEx. You know, "he had a problem and tried to use regex, now he has two problems".

Seriously, I don't get it. It is a tool like any other. If you need a simple website with some text, you don't need PHP/ASP.NET/STG44. Still, no discussion on whether any of those should be avoided. How odd.

In my experience, RegEx is probably the most useful tool I've ever encountered as a developer. It's the most useful tool when it comes to #1 security issue: parsing user input. I has saved me hours if not days of coding and creating potentially buggy (read: crappy) code.

With modern CPUs, I don't see what's the performance issue here. I'm quite willing to sacrifice some cycles for some quality and security. (It's not always the case, though, but I think those cases are rare.)

Still, RegEx is very powerful. With great power, comes great responsibility. It doesn't mean you'll use it whenever you can. Only where it's power is worth using.

As someone mentioned above, HTML parsing with RegEx is like a Russian roulette with a fully loaded gun. Don't overdo anything, RegEx included.

dr Hannibal Lecter
+1 on this post. informative.
jerbersoft
+1 and amen to that. Of course you don't use regex where a simple string substitution will do, but any programmer who can't get their heads around regex isn't in the right profession, they're not easy, but they are simply *not* *that* hard.
Cruachan
+9  A: 

You can substitute "regex" in your question with pretty much any technology and you'll find people who poorly understand the technology or too lazy to learn the technology making such claims.

There is nothing heavy-weight about regular expressions. The most common way that programmers get themselves into trouble using regular expressions is that they try to do too much with a single regular expression. If you use regular expressions for what they're intended (simple pattern matching), you'll be hard-pressed to write procedural code that's more efficient than the equivalent regular expression. Given decent proficiency with regular expressions, the regular expression takes much less time to write, is easier to read, and can be pasted into tools such as RegexBuddy for visualization.

Jan Goyvaerts
The people at the other end of the spectrum--equally ignorant, but enthusiastic anyway--don't help matters either. The ones who bug me most are those who respond to string-manipulation questions with the pithy advice, "use regex". Excuse me? If the OP knew anything about regexes, don't you think he would have thought of them on his own? Often as not, regexes are the wrong tool for the job anyway. (I'm not talking about this site, by the way; I mostly see it in Sun's Java forums.)
Alan Moore
@Alan: Right. Although the regex pushers exist, this site is more of a "have you tried jQuery?" place. Of course, jQuery is a fantastic little library and no one in their right mind should avoid it... but it's not the tool for every job. (Specifically: sometimes you should use regex instead of jQuery)
Shog9
+4  A: 

You should also avoid floating-point numbers at all cost. That is when you're programming in an embedded-environment.

Seriously: if you're in normal software-development you should actually use regex if you need to do something that can't be achieved with simpler string-operations. I'd say that any normal programmer won't be able to implement something that's best done using regexps in a way that is faster than the correspondig regular expression. Once compiled, a regular expression works as a state-maschine that is optimized to near perfection.

marvesmith
+2  A: 

I've seen so many people argue about whether a given regex is correct or not that I'm starting to think the best way to write one is to ask how to do it on StackOverflow and then let the regex gurus fight it out.


I think they're especially useful in JavaScript. JavaScript is transmitted (so should be small) and interpreted from text (although this is changing in the new browsers with V8 and JIT compilation), so a good internal regex engine has a chance to be faster than an algorithm.

I'd say if there is a clear and easy way to do it with string operations, use the string operations. But if you can do a nice regex instead of writing your own state machine interpreter, use the regex.

Nosredna
++for the JS case, although some of the same logic applies to other interpreted languages as well.
Shog9
+1  A: 

I wouldn't say avoid them entirely, as they are QUITE handy at times. However, it is important to realize the fundamental mechanisms underneath. Depending on your implementation, you could have up to exponential run-time for a search, but since searches are usually bounded by some constant number of backtraces, you can end up with the slowest linear run-time you ever saw.

If you want the best answer, you'll have to examine your particular implementation as well as the data you intend to search on.

From memory, wikipedia has a decent article on regular expressions and the underlying algorithms.

San Jacinto