ansaurus

Question

Answer 1

+6 A:

You should use a parser when you are interested in the lexical or semantic meaning of text, when patterns can vary. Parsers are generally overkill when you are simply looking to match or replace patterns of characters, regardless of their functional meaning.

In your case, you seem to be interested in the meaning behind the text ("functional components" of code), so a parser would be the better choice. Parsers can, however, internally make use of regex, so they should not be regarded as mutually exclusive.

A "parser" does not automatically mean it has to be complicated, however. For example, if you are interested in C code blocks, you could simply parse nested groups of { and }. This parser would only be interested in two tokens ('{' and '}') and the blocks of text between them.

However, a simple regex comparison is not sufficient here because of the nested semantics. Take the following code:

void Foo(bool Bar)
{
    if(Bar)
    {
        f();
    }
    else
    {
        g();
    }
}

A parser will understand the overall scope of Foo, as well as each inner scope contained within Foo (the if and else blocks). As it encounters each '{' token, it "understands" their meaning. A simple search, however does not understand the meaning behind the text and may interpret the following to be a block, which we of course know is not correct:

{
    if(Bar)
    {
        f();
    }

lc 2009-04-11 12:35:19

Answer 2

+2 A:

Not exactly sure if it's a duplicate -- but check the following posts out:

dirkgently 2009-04-11 12:35:47

Answer 3

A:

Your question is a bit vague, but I guess my opinion is that when your regex becomes complicated or takes too long, and you have a reasonably defined "language" to deal with, a parser will be easier.

I don't think you can set a line in the sand and say that anything on one side can be done by regex, and on the other side you need a parser. It depends on the situation.

Epcylon 2009-04-11 12:36:34

Answer 4

+1 A:

You need to use a parser as soon as you have a problem regular expressions is not meant to, (or simply can't) solve. Matching (un)balanced parenthesis (recursively) for instance is one of those problems. Eventhough some flavours, like PCRE, get you very far they don't win over a hand written parser.

Martijn Laarman 2009-04-11 12:36:56

Answer 5

+2 A:

There are a few compelling use cases for parsers over regular expressions. You should use a parser instead of a regular expression:

Whenever the kinds of expressions you'd like to work with are more complex than few semantic entities (tags, variables, phone numbers, etc.).
Whenever you need to know the semantic meaning of text instead of merely matching a pattern. For example, if you're trying to match all possible ways of writing a phone number, a parser is probably better than a regex. If you're trying to match a specific pattern that happens to correspond to a phone number, a regex is probably fine.
Whenever input can't be guaranteed to be well-formed.
If you're working entirely within the structure of a well-defined language that has a syntax specification (C#, XML, C++, Ruby, etc.), there's already going to be a parser, so you have some work done for you.

John Feminella 2009-04-11 12:39:33

+1 for the concrete examples.

Masi 2009-04-11 13:50:17

Answer 6

+1 A:

Here are some use cases, courtesy of Steve Yegge: Rich Programmer Food.

Yuval F 2009-04-11 13:09:23

+1 for the personal blog post. I bought three books about compilers, recurrence and such related things after reading =)

Masi 2009-04-11 20:52:40

Thanks. In that case, take a look at:http://stackoverflow.com/questions/725372/which-programming-languages-text

Yuval F 2009-04-12 03:50:44

Answer 7

+2 A:

you need a parser when:

language is not regular (wikipedia)
you need a parse tree (more generally when you need to execute actions contextually)
when the resulting regular expression is too obscure/complex

My 2 cents.

dfa 2009-04-11 14:10:41

Answer 8

+2 A:

The Dragon Book has a small section about what you can't use Regular Expressions for:

They can't detect repetition of a string, meaning you can't match constructs like 'wcw', where w is the same succesion of symbols
You can only detect a fixed number of repetition or an unspecified number of repetitions, which is to say you can't use an already parsed token to determine the number of repetitions, something like: 'n s1 s2 ... sn'
"Regular Expressions can't be used to describe balanced or nested constructs, [like] the set of strings of all balanced parentheses"

For 1 and 2, there's a simple explanation, you can't capture a substring so you can match it later. If you would, than you would be using a parser. Just think of how you would be using regular expressions for those cases, and you will intuitively come to the conclusion you can't. :)

For 3, it's the same as the problem in K&R for parsing string literals. You can't just say a string literal is between the first ' " ' and the second ' " ', but what happens when there's an escaped quote(\")?

As for the relation to Russel's paradox, I think you're hunch is right, because the problem is regex's limited introspection capabilities. The book has references to the proofs. If you want to, I can look them up for you.

Andrei Vajna II 2009-04-12 23:09:04

What are the premises for each argument?1. no inference about itself 2. since memory is limited, tokens must be finite 3. all -- I don't know why but when I read the writing I started to think about Russell's paradox. Can you reduce their proofs to it?

Masi 2009-04-13 01:03:36

I've updated my answer.

Andrei Vajna II 2009-04-13 09:51:59

@Asdrei Vajna II Please, have a try at "%s@\(h\(el\)lo\)@the string is \1 and the substring is \2 @", when you have only a line with a word "hello".

Masi 2009-04-13 11:32:24

When I thought about the paradox, I think the definitions for "string" and "substring" are different, what I presumed in my comment. When I matched a substring, I did not match the whole string "\(h\(el\)lo\)". When I match a whole string "/\(h\(el\)lo/\)", I would not match a substring "el".

Masi 2009-04-13 11:38:29

And when you try to prove each case, you need to reduce the problem to similar situations. When I think about a parser, I spontaneously start to think the Russel thing. It hard for me to think what each writer is meaning by words like string or substring.

Masi 2009-04-13 11:42:51

@Asdrei Vajno II: Can you post the page numbers? I ordered the Dragon book a some time ago. Hopefully, it precisely defines terms.

Masi 2009-04-13 11:44:44

+1 for the explanations :)

Masi 2009-04-13 11:45:49

It's page 98, Chapter 3.3 - "Specification of Tokens", the last section of the chapter, called "Nonregular Sets".

Andrei Vajna II 2009-04-13 14:59:16

Answer 9

A:

There are things that regex cannot do while parser can do.
For example:

Start ::= (Inner);
Inner ::= Start | x;

Regular expression wouldn't be able to do that because regex can't track if there are same number of open and close parenthesis. That is why when you are trying to tokenize and parse a large file, parser is expected to be used, while regex can simply find special pattern(s) inside the file.

bLee 2009-04-12 23:19:12

ansaurus

tags:

views:

answers:

When should I use a parser?

related questions