tags:

views:

1371

answers:

7

In the Stack Overflow podcast #36 (http://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed: Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.

I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.

+2  A: 

I'm sure someone has better examples, but you could check this post by Phil Haack, which has an example of a regular expression and a state machine doing the same thing (there's a previous post with a few more regex examples in there as well I think..)

Check the "HenriFormatter" on that page.

Sciolist
+9  A: 

Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:

^[A-Za-z][A-Za-z0-9_]*$

which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including 0). The following pseudo-code shows how this can be done with a finite state machine:

state = FIRSTCHAR
for char in all_chars_in(string):
    if state == FIRSTCHAR:
            if char is not in the set "A-Z" or "a-z":
                error "Invalid first character"
            state = SUBSEQUENTCHARS
            next char
    if state == SUBSEQUENTCHARS:
            if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
                error "Invalid subsequent character"
            state = SUBSEQUENTCHARS
            next char

Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.

That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.

The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.

paxdiablo
+1  A: 
BobbyShaftoe
Looks like the image link is broken. All I see is a purple box with "gallery.hd.org" in it.
joel.neely
SETEC ASTRONOMY ?
paxdiablo
@joel.neely, hm I don't know the image works for me. SO needs to do image hosting. @Pax, not sure what you mean. :)
BobbyShaftoe
It's a quote from the Sneakers movie.
paxdiablo
Oh I see. :) Ok, maybe it will work now.
BobbyShaftoe
+9  A: 
Alabaster Codify
That site's not that useful - put in [A-Za-z][A-Za-z0-9]{0,4} and see what claptrap is comes out with :-)
paxdiablo
Pax, ouch :/ too bad, such a nice site but can't compile that simple regex
Johannes Schaub - litb
Yep, can't handle repeat qualifiers unfortunately... But you *can* have semantically equivalent claptrap with [A-Za-z]\w{0,4}! He probably left them out for simplicity's sake. Plus repeat qualifiers can really blow out your FSM - the UI would quickly become useless.
Alabaster Codify
"This particular implementation doesn't know about anchors, assertions, non-greedy and bounded qualifiers, collation elements, or backreferences." - that's quite a laundry list of no-can-do's.
paxdiablo
Admittedly, the first RE I gave it used a repeat qualifier, but I think non-greedy and repeat qualifiers are the only things in the TODO list I use day to day, if at all! It's still a useful tool for learning how REs map to FSMs, as the OP requested.
Alabaster Codify
+2  A: 

Is the question "How do I choose the states and the transition conditions?", or "How do I implement my abstract state machine in Foo?"

How do I choose the states and the transition conditions?

I usually use FSMs for fairly simple problems and choose them intuitively. In http://stackoverflow.com/questions/328387/regex-to-replace-all-n-in-a-string-but-no-those-inside-code-code-tag/328403#328403, I just looked at the parsing problem as one of being either Inside or outside a tag pair, and wrote out the transitions from there (with a beginning and ending state to keep the implementation clean).

How do I implement my abstract state machine in Foo?

If your implementation language supports a structure like c's switch statement, then you switch on the current state and process the input to see which action and/or transition too perform next.

Without switch-like structures, or if they are deficient in some way, you if style branching. Ugh.

Written all in one place in c the example I linked would look something like this:

token_t token;
state_t state=BEGIN_STATE;
do {
   switch ( state.getValue() ) {
   case BEGIN_STATE;
      state=OUT_STATE;
      break;
   case OUT_STATE:
      switch ( token.getValue() ) {
         case CODE_TOKEN:
            state = IN_STATE;
            output(token.string());
            break;
         case NEWLINE_TOKEN;
            output("<break>");
            output(token.string());
            break;
         ...
      }
      break;
   ...
   }
} while (state != END_STATE);

which is pretty messy, so I usually rip the state cases out to separate functions.

dmckee
+4  A: 

A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:

>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG)
literal 60
subpattern 1
  in
    range (65, 90)
  max_repeat 0 65535
    in
      range (65, 90)
      range (48, 57)
at at_boundary
max_repeat 0 65535
  not_literal 62
literal 62
subpattern 2
  min_repeat 0 65535
    any None
literal 60
literal 47
groupref 1
literal 62

The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.

ʞɔıu
+1, was waiting for this little gem to surface. Does anyone know where its output is documented? I've never found anything..
Alabaster Codify
This is a great answer, thanks for that.
sykora
+1  A: 

Browsing other [fsm] tagged questions on StackOverflow brought up http://stackoverflow.com/questions/454557/re-fsm-generator, the answer to which are of passing relevance.

dmckee