views:

648

answers:

4

I was reading this article today on two different regular expression algorithms.

According to the article old Unix tools like ed, sed, grep, egrep, awk, and lex, all use what's called the Thompson NFA algorithm in their regular expresssions...

However newer tools like Java, Perl, PHP, and Python all use a different algorithm for their regular expressions that are much, much slower.

This article makes no mention at all of Javascript's regex algorthim, (and yes I know there are various JS engines out there) but I was wondering if anybody knew which of those algorithms they use, and if maybe those algorithms should be swapped out for Thompson NFA.

+7  A: 

Looking at Wikipedia I found out it's "Traditional NFA". A simple question deserves simple answer :)

Peter Perháč
Hey that's neat. thanks!
leeand00
That doesn't make it true. I don't think the javascript language is really finite state.
Charlie Martin
Digging for truth or searching for answers? What is this site all about?
Peter Perháč
@Charlie @MasterPeter Hmm...maybe that's why it isn't on the list of programs on Wikipedia...(and also why the list lacks browsers...)
leeand00
In the mean time, had a look at the syntax. It's definitely not finite state.
Charlie Martin
@Charlie if so, then people should be informed. Add it to your answer. Make it more visible to others.
Peter Perháč
@MasterPeter, done.
Charlie Martin
+3  A: 

Perl uses a memoized recursive backtracking search and, as of some improvements in 5.10, no longer blows up on perl -e '("a" x 100000) =~ /^(ab?)*$/;'. In recent tests I performed on an OS X box, Perl 5.10 outperformed awk, even in the cases where awk's algorithm was supposed to be better.

Chas. Owens
So that article is most likely outta date then...
leeand00
+7  A: 

The Javascript ECMA language description doesn't impose a requirement for the particular implementation of regular expressions, so that part of the question isn't well-formed. You're really wondering about the particular implementation in a particular browser.

The reason Perl/Python etc use a slower algorithm, though, is that the regex language defined isn't really regular expressions. A real regular expression can be expressed as a finite state machine, but the language of regex is context free. That's why the fashion is to just call it "regex" instead of talking about regular expressions.

Update

Yes, in fact javascript regex isn't content free regular. Consider the syntax using `{n,m}', that is, matches from n to m accepted regexs. Let d the difference d=|n-m|. The syntax means there exists a string uxdw that is acceptable, but a string uxk>dw that is not. It follows via the pumping lemma for regular languages that this is not a regular language.

(augh. Thinko corrected.)

Charlie Martin
So the regular expressions in Javascript aren't full regular expressions...yeah, I think remember something about that when I was trying to do a lookbehind in Javascript and stumbled across an article on it faking it instead: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
leeand00
Actually, they're *more* than full regular expressions. "{n,m}", for example, can't be represented in an FSA for arbitrary n,m.
Charlie Martin
The spec DOES dictate certain capabilities, like backreferences and lookaheads, which aren't possible in a DFA's or Thompson NFA's, so it's valid to say that JavaScript regexes are Traditional NFA's.
Alan Moore
@AlanM Only if a "traditional NFA" isn't actually an NFA. The proof's up there: a javascript regex defines a language that isn't regular.
Charlie Martin
Of course. If regexes were limited to matching regular languages, they would be an obscure academic topic and not a feature of scores of programming languages, tools and applications.
Alan Moore
@Charlie Well done Charlie! Wow.
leeand00
@lee eh, just another grad school flashback. @AlanM actually, that's mistaken too. sed, grep, egrep vi, etc -- all the things that are handled by the "traditional" Thompson algorithm -- *are* regular.
Charlie Martin
Your "update" about {n,m} is wrong. x{3,5} can be written as xxx|xxxx|xxxxx which is perfectly regular and handled perfectly well with a DFA engine.
Jan Goyvaerts
Many XML Schema validators transform {n,m} into NFA; there are also other efficient transformations - see http://xtech06.usefulinc.com/schedule/detail/118 . It's the back-refs which are the killer.
Pete Kirkham
@Charlie, I may sound a little nieve on this one, but what makes an expression regular?
leeand00
@Pete Whenever I tried to do that back refs (look behind) with the article that I mentioned, all they did was reverse the string, and run a look ahead, and then un-reverse the string...I guess it depends on the size of the string, but I can't see how that would take up so much time.
leeand00
@Pete, read the first graf: "XML content models are a form of extended context-free grammar, ..." Context free is not finite.
Charlie Martin
@Jan, that's incorrect. While any bounded example if finite, eg, {3,5}. there's no upper bound in the grammar. You tell me the number of states in your machine, I'll construct two languages that it can't distinguish
Charlie Martin
@lee, that's a bigger than I can do in a comment. Have a look at the wikipedia: http://en.wikipedia.org/wiki/Regular_language Basically, a regular language is one that can be recognized by a finite state machine.
Charlie Martin
The unbounded x{3,} can be rewritten as xxxx* which is regular and can be implemented with a DFA with 4 states. Try it at http://osteele.com/tools/reanimator/
Jan Goyvaerts
And if you say that x{7,1238103284} might be a problem, it's not. The state matchine will simply be a bit larger: 1238103285 states.
Jan Goyvaerts
@ Charlie and Jan: I think Jan is correct here. a{3,5} implies aaa|aaaa|aaaaa which is a valid DFA, thus qualifying to be a regular language. Correct use of the pumping lemma is in the process of being verified.
Unknown
@ Charlie: Say that you have x{3,5}. If you set the maximum pumping length to be 6 = 5+1 then, for all matching strings in the language L, they will not need to be confined to the lemmas due to the fact that they are all under the pumping length.
Unknown
Guys, look. @Jan aaaa* is a regular language, as is any string constructed with catenation and Kleene star. Therefore it can be recognized by a DFA. @unk, maybe the issue is that any fixed n,m is finite strings, but arbitrary n,m aren't.
Charlie Martin
@Charlie, but when you define n and m for your language, it must be finite. And once you use a Kleene star, then it will always be possible to split the a's such that the middle portion can be repeated due to Kleene repetition.
Unknown
A: 

Though the ECMA standard does not specify the algorithm an ECMAScript implementation should use, the fact that the standard mandates that ECMAScript regular expressions must support backreferences (\1, \2, etc.) rules out the DFA and "Thompson NFA" implementations.

Jan Goyvaerts
Backreferences rule out DFA (deterministic finite automaton), but there are other ways to solve the problem (e.g. recursive backtracking). Perl uses memoized backtracking recursion which removes a lot of the downsides to recursive backtracking (still eats a lot of memory on certain patterns though).
Chas. Owens