Look for a non-technical explanation of the difference between DFA vs NFA engines based on their capabilities and limitations. Thanks!
views:
55answers:
3DFA engines analyze a regex to understand of every type of string it could possibly match: no match, partial match, full match. The engine takes those patterns and checks the entire input, and only outputs results after it's finished processing the input. DFA engines use more CPU-cycles and memory for the initialization of a regex, but do not need to backtrack, so they will tend to do individual matches faster. A pure DFA can't provide backreferences. Implementations: awk, egrep, MySQL, flex, lex, Procmail
NFA engines analyze inputs on a first progressively iterating through the regex input. NFA engines support backtracking, which means that they allow options patterns to match. Implementations: Java, grep, .NET, PCRE library, Perl, PHP, Python, Ruby, sed, vi
SOURCES:
Comparison of the performance characteristics of DFA vs NFA.
UPDATES: Added link to DFA vs NFA performance comparison.
COMMENT: Still doesn't make sense to me, but better than no answer at all.
A simple, nontechnical explanation, paraphrased from Jeffrey Friedl's book Mastering Regular Expressions.
CAVEAT:
While this book is generally considered the "regex bible", there appears some controversy as to whether the distinction made here between DFA and NFA is actually correct. I'm not a computer scientist, and I don't understand most of the theory behind what really is a "regular" expression, deterministic or not. After the controversy started, I deleted this answer because of this, but since then it has been referenced in comments to other answers. I would be very interested in discussing this further - can it be that Friedl really is wrong? Or did I get Friedl wrong (but I reread that chapter yesterday evening, and it's just like I remembered...)?
Edit: It appears that Friedl and I are indeed wrong. Please check out Eamon's excellent comments below.
Original answer:
A DFA engine steps through the input string character by character and tries (and remembers) all possible ways the regex could match at this point. If it reaches the end of the string, it declares success.
Imagine the string AAB
and the regex A*AB
. We now step through our string letter by letter.
A
:- First branch: Can be matched by
A*
. - Second branch: Can be matched by ignoring the
A*
(zero repetitions are allowed) and using the secondA
in the regex.
- First branch: Can be matched by
A
:- First branch: Can be matched by expanding
A*
. - Second branch: Can't be matched by
B
. Second branch fails. But: - Third branch: Can be matched by not expanding
A*
and using the secondA
instead.
- First branch: Can be matched by expanding
B
:- First branch: Can't be matched by expanding
A*
or by moving on in the regex to the next tokenA
. First branch fails. - Third branch: Can be matched. Hooray!
- First branch: Can't be matched by expanding
A DFA engine never backtracks in the string.
An NFA engine steps through the regex token by token and tries all possible permutations on the string, backtracking if necessary. If it reaches the end of the regex, it declares success.
Imagine the same string and the same regex as before. We now step through our regex token by token:
A*
: MatchAA
. Remember the backtracking positions 0 (start of string) and 1.A
: Doesn't match. But we have a backtracking position we can return to and try again. The regex engine steps back one character. NowA
matches.B
: Matches. End of regex reached (with one backtracking position to spare). Hooray!
DFAs and NFAs have exactly the same capabilities and limitations. The only difference is notational convenience.
A finite automaton is a processor that has states and reads input, each input character potentially setting it into another state. For example, a state might be "just read two Cs in a row" or "am starting a word". These are usually used for quick scans of text to find patterns, such as lexical scanning of source code to turn it into tokens.
A deterministic finite automaton is in one state at a time, which is implementable. A nondeterministic finite automaton can be in more than one state at a time: for example, in a language where identifiers can begin with a digit, there might be a state "reading a number" and another state "reading an identifier", and an NFA could be in both at the same time when reading something starting "123". Which state actually applies would depend on whether it encountered something not numeric before the end of the word.
Now, we can express "reading a number or identifier" as a state itself, and suddenly we don't need the NFA. If we express combinations of states in an NFA as states themselves, we've got a DFA with a lot more states than the NFA, but which does the same thing.
It's a matter of which is easier to read or write or deal with. DFAs are easier to understand per se, but NFAs are generally smaller.