tags:

views:

361

answers:

9

I have had problems in Regexes to divide a code up into functional components. They can break or it can take a long time for them to finish. The experience raises a question:

"When should I use a parser?"

+6  A: 

You should use a parser when you are interested in the lexical or semantic meaning of text, when patterns can vary. Parsers are generally overkill when you are simply looking to match or replace patterns of characters, regardless of their functional meaning.

In your case, you seem to be interested in the meaning behind the text ("functional components" of code), so a parser would be the better choice. Parsers can, however, internally make use of regex, so they should not be regarded as mutually exclusive.


A "parser" does not automatically mean it has to be complicated, however. For example, if you are interested in C code blocks, you could simply parse nested groups of { and }. This parser would only be interested in two tokens ('{' and '}') and the blocks of text between them.

However, a simple regex comparison is not sufficient here because of the nested semantics. Take the following code:

void Foo(bool Bar)
{
    if(Bar)
    {
        f();
    }
    else
    {
        g();
    }
}

A parser will understand the overall scope of Foo, as well as each inner scope contained within Foo (the if and else blocks). As it encounters each '{' token, it "understands" their meaning. A simple search, however does not understand the meaning behind the text and may interpret the following to be a block, which we of course know is not correct:

{
    if(Bar)
    {
        f();
    }
lc
+2  A: 

Not exactly sure if it's a duplicate -- but check the following posts out:

dirkgently
A: 

Your question is a bit vague, but I guess my opinion is that when your regex becomes complicated or takes too long, and you have a reasonably defined "language" to deal with, a parser will be easier.

I don't think you can set a line in the sand and say that anything on one side can be done by regex, and on the other side you need a parser. It depends on the situation.

Epcylon
+1  A: 

You need to use a parser as soon as you have a problem regular expressions is not meant to, (or simply can't) solve. Matching (un)balanced parenthesis (recursively) for instance is one of those problems. Eventhough some flavours, like PCRE, get you very far they don't win over a hand written parser.

Martijn Laarman
+2  A: 

There are a few compelling use cases for parsers over regular expressions. You should use a parser instead of a regular expression:

  • Whenever the kinds of expressions you'd like to work with are more complex than few semantic entities (tags, variables, phone numbers, etc.).
  • Whenever you need to know the semantic meaning of text instead of merely matching a pattern. For example, if you're trying to match all possible ways of writing a phone number, a parser is probably better than a regex. If you're trying to match a specific pattern that happens to correspond to a phone number, a regex is probably fine.
  • Whenever input can't be guaranteed to be well-formed.
  • If you're working entirely within the structure of a well-defined language that has a syntax specification (C#, XML, C++, Ruby, etc.), there's already going to be a parser, so you have some work done for you.
John Feminella
+1 for the concrete examples.
Masi
+1  A: 

Here are some use cases, courtesy of Steve Yegge: Rich Programmer Food.

Yuval F
+1 for the personal blog post. I bought three books about compilers, recurrence and such related things after reading =)
Masi
Thanks. In that case, take a look at:http://stackoverflow.com/questions/725372/which-programming-languages-text
Yuval F
+2  A: 

you need a parser when:

  1. language is not regular (wikipedia)
  2. you need a parse tree (more generally when you need to execute actions contextually)
  3. when the resulting regular expression is too obscure/complex

My 2 cents.

dfa
+2  A: 

The Dragon Book has a small section about what you can't use Regular Expressions for:

  • They can't detect repetition of a string, meaning you can't match constructs like 'wcw', where w is the same succesion of symbols
  • You can only detect a fixed number of repetition or an unspecified number of repetitions, which is to say you can't use an already parsed token to determine the number of repetitions, something like: 'n s1 s2 ... sn'
  • "Regular Expressions can't be used to describe balanced or nested constructs, [like] the set of strings of all balanced parentheses"


For 1 and 2, there's a simple explanation, you can't capture a substring so you can match it later. If you would, than you would be using a parser. Just think of how you would be using regular expressions for those cases, and you will intuitively come to the conclusion you can't. :)

For 3, it's the same as the problem in K&R for parsing string literals. You can't just say a string literal is between the first ' " ' and the second ' " ', but what happens when there's an escaped quote(\")?

As for the relation to Russel's paradox, I think you're hunch is right, because the problem is regex's limited introspection capabilities. The book has references to the proofs. If you want to, I can look them up for you.

Andrei Vajna II
What are the premises for each argument?1. no inference about itself 2. since memory is limited, tokens must be finite 3. all -- I don't know why but when I read the writing I started to think about Russell's paradox. Can you reduce their proofs to it?
Masi
I've updated my answer.
Andrei Vajna II
@Asdrei Vajna II Please, have a try at "%s@\(h\(el\)lo\)@the string is \1 and the substring is \2 @", when you have only a line with a word "hello".
Masi
When I thought about the paradox, I think the definitions for "string" and "substring" are different, what I presumed in my comment. When I matched a substring, I did not match the whole string "\(h\(el\)lo\)". When I match a whole string "/\(h\(el\)lo/\)", I would not match a substring "el".
Masi
And when you try to prove each case, you need to reduce the problem to similar situations. When I think about a parser, I spontaneously start to think the Russel thing. It hard for me to think what each writer is meaning by words like string or substring.
Masi
@Asdrei Vajno II: Can you post the page numbers? I ordered the Dragon book a some time ago. Hopefully, it precisely defines terms.
Masi
+1 for the explanations :)
Masi
It's page 98, Chapter 3.3 - "Specification of Tokens", the last section of the chapter, called "Nonregular Sets".
Andrei Vajna II
A: 

There are things that regex cannot do while parser can do.
For example:

Start ::= (Inner);
Inner ::= Start | x;

Regular expression wouldn't be able to do that because regex can't track if there are same number of open and close parenthesis. That is why when you are trying to tokenize and parse a large file, parser is expected to be used, while regex can simply find special pattern(s) inside the file.

bLee