tags:

views:

247

answers:

7

This is just a question out of curiosity since I have been needing to get more and more into parsing and using regex lately.. it seems, for questions I come across in my searches regarding parsing of some sort, someone always ends up saying, when asked something relating to regex, "regex isn't good for that, use such and such parser instead"... as I have come to better understand regex, I think most stuff is possible, just its rather complex and time consuming since you have to account for many different possiblities, and of course, it has to be combined with conditional statements and loops to build any sort of parser.. so I'm wondering if regex is what is used to build most parsers or is there some other method being used.. I am just wondering since I may have the need to build some fairly complex custom parsers coming up where there isn't necessarily an existing one to use.

thanks for any info as I can't seem to find a direct answer to this.

+2  A: 

A 'regex' as you know it is a particular notation for creating deterministic finite automata. A DFA is a parsing device, and thus regexps do parse. When you use regexps to match something, you are parsing a string to align it with the pattern. When you use regexps to chop something up into bits with parentheses, you are parsing.

DFAs are formally defined as parsers for a particular category of languages called 'regular languages' (thanks to Gumbo for reminding me). Many important tasks do not involve regular languages.

Thus, DFAs are not a good approach to many parsing problems. The most famous examples around here are XML and HTML. There are many reasons, but I'll fill in one. These things are fundamentally tree structures. To parse them, a program has to maintain state as it descends the tree. Regexps don't do that.

Parsers defined by a grammar (such as LR(k) and LL(k)) do that, top-down hand-coded parsers do that.

There are books and books on the various alternative parsing technologies that are commonly applied to parsing things like C++, or XML.

bmargulies
Both regular expressions and deterministic finite automata are ways to describe regular languages.
Gumbo
+2  A: 

(Most) parsers are created for recursive languages ie. languages that have recursive features. RegExps can't handle recursivity, so they aren't used for parser construction (without extra hacks a la Perl Markdown). However, RegExps are used for developing lexers, as they make life much easier that way.

Aviral Dasgupta
+2  A: 

Well, building a parser is pretty complex and you can use regex but that's not the only things you use. I suggest to read the Dragon Book

These days, in my opinion, you should use a parser generator because you can do it from scratch but it's not simple nor quick to do. You have to consider, generally speaking, regex and finite state automata for the lexical analysis; context-free grammars, LL parsers, bottom-up parsers, and LR parsers for Syntax analysis etc...etc...

dierre
thanks, sounds good, will check out parser generators
Rick
why do people always recommend the dragon book even for beginner questions about parsing? this book is for people who really want to study parsing in an academic fashion. there are better, more practical books out there...
milan1612
Well, I've used it, to me it's just like the bible. Why don't you suggest something else? I'm actually curious.
dierre
I really like 'Parsing Techniques - A Practical Guide' (http://www.cs.vu.nl/~dick/PTAPG.html)
milan1612
@milan1612 - Just looked at Parsing Technique... looks like a very nice book indeed. It doesn't seem like a better recommendataion for a parsing-beginner than Aho to me, though. Its just as technical; perhaps more so, since the entire book seems focused on parsing and Aho mercifully stops essentially with just one chapter.
Ira Baxter
+1  A: 

Typically, you use some sort of pattern-matching (not necessarily regular expressions) in a lexer, to turn your stream of characters into a stream of tokens, and have your parser look at those tokens instead of the raw character input.

If you're looking to produce your own parsers, you're probably better off looking at something like Bison to assist with that.

Anon.
thanks, yeah basically I'm looking for the most efficient way to do this sort of thing, will look into that
Rick
+2  A: 

Regexes can be used to parse a certain class of language (finite state language), but their power is limited compared to other formalisms, and as you mention, they quickly become unweildy and hard to maintain.

For example, it is not possible to have a regex that can ensure for each open parenthesis that there is a matching close parenthesis - something that most languages have in their expression syntax.

Regexes are usually used to do tokenization, and the tokens then combined to create the desired syntax.

mdma
yeah, I found myself starting to do this sort of thing from regex where I break it down then re-evaluate each token to check for certain conditions, i imagine this is how a higher level parser would work, at least one that is working on extracting data from something that may not necessarily be uniform (broken html, etc)
Rick
+4  A: 

Typically, you'll use two (at least) types of tools in building your parser.

The first part is lexical analysis -- separating characters into tokens and filtering out comments and whitespace. That part is typically done with regular expressions. Well, it's even more typically done using a scanner generator that converts a collection of pairs of regular expressions and code into a program that executes the corresponding code when it recognizes the regular expressions. This turns out to be more efficient than testing each regular expression each time, and it also works better for various other reasons. FLEX is a common tool for this in C.

The second part of your parser is the grammar. The most typical tool for this is another program-generator that accepts a context-free grammar (CFG) annotated with rules for interpreting the component "parts of speech", as it were. A CFG is able to express things like balanced parenthesis, which a regular expression cannot (unless it's been extended with CF features, making it not strictly "regular" in the mathematical sense). But a CFG with rules is very nice because you can attach a semantic meaning to the phrase structure of your language. BISON is a common tool for this part in C.

But I actually told you a little lie. You see, every real programming language has parts that cannot be expressed within a context-free framework. For example, you need to connect the definition of a variable with the use of it so that you know what instructions to generate, and also if an operation on it is legal. That's typically considered outside the scope of parsing, but there are such things as "attribute grammars" which are like CFGs extended with features that can make even these context-dependencies much easier to code up and work with.

Now, there's no rule that says you HAVE to use such tools. Many simple grammars are easy enough to process with hand-written tools. For example, LISP's S-expressions can be simply scanned as:

If it starts with a digit, read a number. If it starts with a letter, read a symbol. If it's a space, skip it. If it's an open-paren, then skip it, recurse this routine for a value, and expect a close paren.

Well, there are a few more complications for strings and what-have-you, but that's the basic idea. Parsing FORTH is even simpler, because it doesn't build a recursive data structure.

Anyway, that should get you going on whatever your project is.

Ian
thanks for the in-depth answer, I have a good understanding of how to proceed now with what I am trying to do in my small projects, I think its simple enough I can just write some small CFG rules with php to manipulate my regex tokens
Rick
Glad I could help. Perhaps you'd like to look at "LIME", a parser generator for PHP? It's on Source Forge. It comes with a couple examples you can use to get started.
Ian
+1  A: 

Regular expressions are defined over arbitrary tokens, but most programmers encounter them only in the context of strings of characters, and so it is easy to beleive they are only useful for strings.

As a pure capability, regular expressions (actually, a single regular expression) cannot parse any language that requires a context-free grammar.

What makes context-free grammars different than regular expressions is that you can define a large set of named "recognizers" of subgrammars of a language, that can refer to one another recursively. These rules can all be limited to just the simple form of:

 LHS =  RHS1 RHS2 ... RHSn ;

(so call "Backus Naur form" or BNF) where each LHS and RHSi are names primitive language elements or nonterminals in the langauge. (I build a very complex language processing tool that uses just this form; you need more rules but it is very usable).

But most people writing grammars want a more expressive form, and so use an "extended BNF". If you examine these EBNFs closely what they generally do is add the ideas from regular expressions (alternation, kleene star/plus) to the BNF formalism. Thus you can find EBNFs with "*" and "+".

So, what follows is an EBNF for itself, using regexp ideas:

 EBNF = RULE+ ;
 RULE = IDENTIFIER '=' ALTERNATIVES ';' ;
 ALTERNATIVES = RHS ( '|' RHS )* ;
 RHS = ITEM* ;
 ITEM = IDENTIFIER | QUOTEDTOKEN | '(' ALTERNATIVES ')' | ITEM ( '*' | '+' ) ;

So, regular expression ideas can be used to express grammars. A parser generator that accepts such notation (including you doing it by hand) is needed to generate a parser from a grammar instance.

Ira Baxter