views:

55

answers:

3

Look for a non-technical explanation of the difference between DFA vs NFA engines based on their capabilities and limitations. Thanks!

A: 

DFA engines analyze a regex to understand of every type of string it could possibly match: no match, partial match, full match. The engine takes those patterns and checks the entire input, and only outputs results after it's finished processing the input. DFA engines use more CPU-cycles and memory for the initialization of a regex, but do not need to backtrack, so they will tend to do individual matches faster. A pure DFA can't provide backreferences. Implementations: awk, egrep, MySQL, flex, lex, Procmail

NFA engines analyze inputs on a first progressively iterating through the regex input. NFA engines support backtracking, which means that they allow options patterns to match. Implementations: Java, grep, .NET, PCRE library, Perl, PHP, Python, Ruby, sed, vi

SOURCES:


Comparison of the performance characteristics of DFA vs NFA.


UPDATES: Added link to DFA vs NFA performance comparison.


COMMENT: Still doesn't make sense to me, but better than no answer at all.

blunders
NFA engines do *not* necessarily support backtracking. Two rather interesting ones, namely RE2 and TRE, for instance, use NFA simulation, which is generally much faster.
Eamon Nerbonne
**@Eamon_Nerbonne:** Yeah, was guessing there was something wrong with my understanding. Thanks for pointing out RE2 and TRE.
blunders
+1  A: 

A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions.

CAVEAT:

While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct. I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not. After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers. I would be very interested in discussing this further - can it be that Friedl really is wrong? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)?

Edit: It appears that Friedl and I are indeed wrong. Please check out Eamon's excellent comments below.


Original answer:

A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. If it reaches the end of the string, it declares success.

Imagine the string AAB and the regex A*AB. We now step through our string letter by letter.

  1. A:

    • First branch: Can be matched by A*.
    • Second branch: Can be matched by ignoring the A* (zero repetitions are allowed) and using the second A in the regex.
  2. A:

    • First branch: Can be matched by expanding A*.
    • Second branch: Can't be matched by B. Second branch fails. But:
    • Third branch: Can be matched by not expanding A* and using the second A instead.
  3. B:

    • First branch: Can't be matched by expanding A* or by moving on in the regex to the next token A. First branch fails.
    • Third branch: Can be matched. Hooray!

A DFA engine never backtracks in the string.


An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. If it reaches the end of the regex, it declares success.

Imagine the same string and the same regex as before. We now step through our regex token by token:

  1. A*: Match AA. Remember the backtracking positions 0 (start of string) and 1.
  2. A: Doesn't match. But we have a backtracking position we can return to and try again. The regex engine steps back one character. Now A matches.
  3. B: Matches. End of regex reached (with one backtracking position to spare). Hooray!
Tim Pietzcker
**@Tim_Pietzcker:** Yes, that really was "a simple, nontechnical explanation" of how a DFA loads a regex and processes input - again, thanks!
blunders
**@Tim_Pietzcker:** Thanks for posting the NFA engine steps too. Deleted the other question... :-)
blunders
This answer is inaccurate - catastrophic backtracking is orthogonal to the whole NFA/DFA distinction. And what you describe as a DFA is actually an NFA (using the typical superposition of states) - DFA's are only ever in one state, hence "deterministic", and NFA's may be in multiple states, hence non-deterministic.
Eamon Nerbonne
In some sense, it's all just terminology. Having said that, DFA's are deterministic (and it's in the name) and NFA's are non-deterministic (again, as the name says). That has a pretty-straightforward reason: DFA are always in exactly one state, and when presented with a character, there's always one unique (deterministic) next state that character always corresponds to. So, you first explanation is a fine regex algorithm, but it's not a DFA - obviously, and as you describe, there can be multiple options and you never know which one is "best" until the string ends.
Eamon Nerbonne
Your second algorithm labeled **NFA engine** is indeed one possible NFA implementation. It resolves the same ambiguity are your first (also NFA) algorithm differently: namely by just picking one option and backtracking as needed. So, it is indeed an NFA, but it's not the *only possible* NFA, as your first method demonstrates: that deals with the same nondeterminism differently. I suppose you could call this a backtracking NFA engine, to distinguish the two.
Eamon Nerbonne
Finally, it's worth nothing that in *any* finite state automaton, as the name implies, the states are *finite* - more specifically, after embedding any relevant info into the state, that tuple still needs to have a finite number of options. And *that* means that strictly speaking, perl-compatible engines aren't really *any* type of FSA, neither DFA nor NFA: after all, you can include arbitrary-length backreferences, and there's an **infinite** number of arbitrary length strings.
Eamon Nerbonne
The distinction is critically important to performance because the infinite state space mean that you can't precompile the NFA nor efficiently execute it using the first algorithm. In the general case backtracking breaks when faces with regexes (and you come across these *in practice*) which cause catastrophic backtracking.
Eamon Nerbonne
So, there might be some non-NFA means of doing backreferences efficiently (I suspect that actually there is), but NFA can't be used to deal with backreferences. Strictly speaking they can't do it at all, and loosely speaking and permitting unbounded annotation they can't do it reliably.
Eamon Nerbonne
@Eamon: Thank you very much for this insightful information. It's a shame that upvoting your comments won't give you rep.
Tim Pietzcker
+2  A: 

DFAs and NFAs have exactly the same capabilities and limitations. The only difference is notational convenience.

A finite automaton is a processor that has states and reads input, each input character potentially setting it into another state. For example, a state might be "just read two Cs in a row" or "am starting a word". These are usually used for quick scans of text to find patterns, such as lexical scanning of source code to turn it into tokens.

A deterministic finite automaton is in one state at a time, which is implementable. A nondeterministic finite automaton can be in more than one state at a time: for example, in a language where identifiers can begin with a digit, there might be a state "reading a number" and another state "reading an identifier", and an NFA could be in both at the same time when reading something starting "123". Which state actually applies would depend on whether it encountered something not numeric before the end of the word.

Now, we can express "reading a number or identifier" as a state itself, and suddenly we don't need the NFA. If we express combinations of states in an NFA as states themselves, we've got a DFA with a lot more states than the NFA, but which does the same thing.

It's a matter of which is easier to read or write or deal with. DFAs are easier to understand per se, but NFAs are generally smaller.

David Thornley
**@David Thornley:** Thank you, was looking for confirmation that there really are not major difference between the too... just wasn't 100% sure. Again, thanks!
blunders
+1 this is correct.
Eamon Nerbonne
Ok, a now-deleted answer confused NFA's with DFA's. I've seen people do that before, and apparently that's down to an otherwise useful book, or so this claims anyhow: http://fanf.livejournal.com/37166.html
Eamon Nerbonne