views:

563

answers:

7

It seems that the choice to use string parsing vs. regular expressions comes up on a regular basis for me anytime a situation arises that I need part of a string, information about said string, etc.

The reason that this comes up is that we're evaluating a soap header's action, after it has been parsed into something manageable via the OperationContext object for WCF and then making decisions on that. Right now, the simple solution seems to be basic substring'ing to keep the implementation simple, but part of me wonders if RegEx would be better or more robust. The other part of me wonders if it'd be like using a shotgun to kill a fly in our particular scenario.

So I have to ask, what's the typical threshold that people use when trying to decide to use RegEx over typical string parsing. Note that I'm not very strong in Regular Expressions, and because of this, I try to shy away unless it's absolutely vital to avoid introducing more complication than I need.

If you couldn't tell by my choice of abbreviations, this is in .NET land (C#), but I believe that doesn't have much bearing on the question.


EDIT: It seems as per my typical Raybell charm, I've been too wordy or misleading in my question. I want to apologize. I was giving some background to help give clues as to what I was doing, not mislead people.

I'm basically looking for a guideline as to when to use substring, and variations thereof, over Regular Expressions and vice versa. And while some of the answers may have missed this (and again, my fault), I've genuinely appreciated them and up-voted as accordingly.

I hope this helps some.

+5  A: 

The regex can be

  • easier to understand
  • express more clearly the intent
  • much shorter
  • easier to change/adapt

In some situations all of those advantages would be achieved by using a regex, in others only some are achieved (the regex is not really easy to understand for example) and in yet other situations the regex is harder to understand, obfuscates the intent, longer and hard to change.

The more of those (and possibly other) advantages I gain from the regex, the more likely I am to use them.

Possible rule of thumb: if understanding the regex would take minutes for someone who is somewhat familiar with regular expressions, then you don't want to use it (unless the "normal" code is even more convoluted ;-).

Hm ... still no simple rule-of-thumb, sorry.

Joachim Sauer
+1  A: 

When your required transformation isn't basic -- but is still conceptually simple.

no reason to pull out Regex if you're doing a straight string replacement, for example... its easier to just use the string.Replace

on the other hand, a complex rule with many conditionals or special cases that would take more than 50 characters of regex can be a nightmare to maintain later on if you don't explicitly write it out

Jimmy
A: 

I would always use a regex unless it's something very simple such as splitting a comma-separated string. If I think there's a chance the strings might one day get more complicated, I'll probably start with a regex.

I don't subscribe to the view that regexes are hard or complicated. It's one tool that every developer should learn and learn well. They have a myriad of uses, and once learned, this is exactly the sort of thing you never have to worry about ever again.

Regexes are rarely overkill - if the match is simple, so is the regex.

Draemon
Even something a CSV parser is deceptively complex to write, given the quotation rules. (Newline characters and commas can both occur within a single field, as long as the field is enclosed in quotes.) Don't underestimate the humble CSV!!! Even with a regex, it's really hard to parse correctly :o)
benjismith
I said a comma-separated string, not a CSV file. I would never recommend anything but a dedicated library or parser for a CSV file. I've actually written a C++ CSV parser which coped with all of the above, but my father was a DFA
Draemon
+2  A: 

[W]e're evaluating a soap header's action and making decisions on that

Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.

To answer your question, in general the usage of regular expressions should be minimized as they're not very readable. Oftentimes you can combine string parsing and regular expressions (perhaps in a loop) to create a much simpler solution than regular expressions alone.

Tmdean
I was kind of misleading here, and I apologize. The reality is that by the time we're mucking with this, it's been parsed for us via the OperationContext. I thank you for pointing this out, though!
Steven Raybell
I've updated the question a bit to improve clarity, but it seems to me that it's still confusing. I'll recraft it a bit more when I have more time. I apologize.
Steven Raybell
Sorry. I probably could have been more polite, but this is something that just drives me nuts every time I see it.
Tmdean
Oh no worries! I'm right there with ya. There's the right tool for the right job. No need in me recreating the wheel, or parser, as it may be.
Steven Raybell
+9  A: 

My main guideline is to use regular expressions for throwaway code, and for user-input validation. Or when I'm trying to find a specific pattern within a big glob of text. For most other purposes, I'll write a grammar and implement a simple parser.

One important guideline (that's really hard to sidestep, though I see people try all the time) is to always use a parser in cases where the target language's grammar is recursive.

For example, consider a tiny "expression language" for evaluating parenthetized arithmetic expressions. Examples of "programs" in this language would look like this:

1 + 2
5 * (10 - 6)
((1 + 1) / (2 + 2)) / 3

A grammar is easy to write, and looks something like this:

DIGIT := ["0"-"9"]
NUMBER := (DIGIT)+
OPERATOR := ("+" | "-" | "*" | "/" )
EXPRESSION := (NUMBER | GROUP) (OPERATOR EXPRESSION)?
GROUP := "(" EXPRESSION ")"

With that grammar, you can build a recursive descent parser in a jiffy.

An equivalent regular expression is REALLY hard to write, because regular expressions don't usually have very good support for recursion.

Another good example is JSON ingestion. I've seen people try to consume JSON with regular expressions, and it's INSANE. JSON objects are recursive, so they're just begging for regular grammars and recursive descent parsers.


Hmmmmmmm... Looking at other people's responses, I think I may have answered the wrong question.

I interpreted it as "when should use use a simple regex, rather than a full-blown parser?" whereas most people seem to have interpreted the question as "when should you roll your own clumsy ad-hoc character-by-character validation scheme, rather than using a regular expression?"

Given that interpretation, my answer is: never.


Okay.... one more edit.

I'll be a little more forgiving of the roll-your-own scheme. Just... don't call it "parsing" :o)

I think a good rule of thumb is that you should only use string-matching primitives if you can implement ALL of your logic using a single predicate. Like this:

if (str.equals("DooWahDiddy")) // No problemo.

if (str.contains("destroy the earth")) // Okay.

if (str.indexOf(";") < str.length / 2) // Not bad.

Once your conditions contain multiple predicates, then you've started inventing your own ad hoc string validation language, and you should probably just man up and study some regular expressions.

if (str.startsWith("I") && str.endsWith("Widget") &&
    (!str.contains("Monkey") || !str.contains("Pox")))  // Madness.

Regular expressions really aren't that hard to learn. Compared to a huuuuge full-featured language like C# with dozens of keywords, primitive types, and operators, and a standard library with thousands of classes, regular expressions are absolutely dirt simple. Most regex implementations support about a dozen or so operations (give or take).

Here's a great reference:

http://www.regular-expressions.info/

PS: As a bonus, if you ever do want to learn about writing your own parsers (with lex/yacc, ANTLR, JavaCC, or other similar tools), learning regular expressions is a great preparation, because parser-generator tools use many of the same principles.

benjismith
I was under the impression that "basic string parsing" implied things like 1 .indexOf() and 2 .subString() calls or something similar. For things as complex as this, I'd definitely go with the parser route as well.
Joachim Sauer
I'm not necessarily doing a character-by-character validation. I'm simply wanting to grab a substring, and then act on that. In general, I'm looking for what's the general guideline to choose substring'ing over regex. I believe I may not have been very clear in my question...
Steven Raybell
So, out of all of them, followed up with your recent edit, this is basically what I was looking for. Thanks!
Steven Raybell
Glad I could be of (eventual) assistance!
benjismith
With you on the whole "real parser" thing - why are people so scared of grammars?
Draemon
Good question. I think most developers are more comfortable learning a new technology that comes with an instruction manual (like "spring" or "javascript") rather than learning a new set of abstract concepts (like "parsing" or "machine learning").
benjismith
(...continued...) For me, it's the opposite. I get bored reading endless API docs from enormous enterprise frameworks, but I get really jazzed about solving tricky problems with new concepts, algorithms, and mathematical tricks. I think of myself as more of a "CS guy" than a "software engineer".
benjismith
As for parsers, would you recommend one of those tools as a good starting point? I've done some basic stuff one would suppose, but never really dove into anything involved.
Steven Raybell
In Java, my favorite tool is JavaCC. It's pretty easy to learn (if you have a regex background) and it's pretty powerful too. But for other platforms (or for multi-platform support) you can't beat ANTLR. It's somewhat more complex and difficult to learn, but it's **really** powerful.
benjismith
+1  A: 

I would agree with what benjismith said, but want to elaborate just a bit. For very simple syntaxes, basic string parsing can work well, but so can regexes. I wouldn't call them overkill. If it works, it works - go with what you find simplest. And for moderate to intermediate string parsing, a regex is usually the way to go.

As soon as you start finding yourself needing to define a grammar however, i.e. complex string parsing, get back to using some sort of finite state machine or the likes as quickly as you can. Regexes simply don't scale well, to use the term loosely. They get complex, hard to interpret, and even incapable.

I've seen at least one project where the use of regexes kept growing and growing and soon they had trouble inserting new functionality. When it finally came time to do a new major release, they dumped all the regexes and went the route of a grammar parser.

dave mankoff
In one case here, I've seen a regular expression actually recursively loop with just the right input. Spiked server CPUs and allowed DOS to take place. So needless to say, I'm quite cautious when I see them come up as a solution for this very reason.
Steven Raybell
A: 

I would think the easiest way to know when to use regular expressions and when not to, is when your string search requires an IF/THEN statement or anything resembling this or that logic, then you need something better than a simple string comparison which is where regex shines.

TravisO