Which Regular Expression Algorithm does Javascript use for Regex?

The Javascript ECMA language description doesn't impose a requirement for the particular implementation of regular expressions, so that part of the question isn't well-formed. You're really wondering about the particular implementation in a particular browser.

The reason Perl/Python etc use a slower algorithm, though, is that the regex language defined isn't really regular expressions. A real regular expression can be expressed as a finite state machine, but the language of regex is context free. That's why the fashion is to just call it "regex" instead of talking about regular expressions.

Update

Yes, in fact javascript regex isn't ~~content free~~ regular. Consider the syntax using `{n,m}', that is, matches from n to m accepted regexs. Let d the difference d=|n-m|. The syntax means there exists a string ux^dw that is acceptable, but a string ux^k>dw that is not. It follows via the pumping lemma for regular languages that this is not a regular language.

(augh. Thinko corrected.)

Hey that's neat. thanks!

leeand00 2009-04-07 21:08:10

That doesn't make it true. I don't think the javascript language is really finite state.

Charlie Martin 2009-04-07 21:09:04

Digging for truth or searching for answers? What is this site all about?

Peter Perháč 2009-04-07 21:11:51

@Charlie @MasterPeter Hmm...maybe that's why it isn't on the list of programs on Wikipedia...(and also why the list lacks browsers...)

leeand00 2009-04-07 21:17:29

In the mean time, had a look at the syntax. It's definitely not finite state.

Charlie Martin 2009-04-07 21:53:10

@Charlie if so, then people should be informed. Add it to your answer. Make it more visible to others.

Peter Perháč 2009-04-07 22:04:31

@MasterPeter, done.

Charlie Martin 2009-04-07 22:21:51

So the regular expressions in Javascript aren't full regular expressions...yeah, I think remember something about that when I was trying to do a lookbehind in Javascript and stumbled across an article on it faking it instead: http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript

leeand00 2009-04-07 21:13:38

Actually, they're *more* than full regular expressions. "{n,m}", for example, can't be represented in an FSA for arbitrary n,m.

Charlie Martin 2009-04-07 21:51:17

The spec DOES dictate certain capabilities, like backreferences and lookaheads, which aren't possible in a DFA's or Thompson NFA's, so it's valid to say that JavaScript regexes are Traditional NFA's.

Alan Moore 2009-04-07 22:24:22

@AlanM Only if a "traditional NFA" isn't actually an NFA. The proof's up there: a javascript regex defines a language that isn't regular.

Charlie Martin 2009-04-07 22:57:11

Of course. If regexes were limited to matching regular languages, they would be an obscure academic topic and not a feature of scores of programming languages, tools and applications.

Alan Moore 2009-04-07 23:28:07

@Charlie Well done Charlie! Wow.

leeand00 2009-04-08 03:39:29

@lee eh, just another grad school flashback. @AlanM actually, that's mistaken too. sed, grep, egrep vi, etc -- all the things that are handled by the "traditional" Thompson algorithm -- *are* regular.

Charlie Martin 2009-04-08 04:23:09

Your "update" about {n,m} is wrong. x{3,5} can be written as xxx|xxxx|xxxxx which is perfectly regular and handled perfectly well with a DFA engine.

Jan Goyvaerts 2009-04-08 09:41:27

Many XML Schema validators transform {n,m} into NFA; there are also other efficient transformations - see http://xtech06.usefulinc.com/schedule/detail/118 . It's the back-refs which are the killer.

Pete Kirkham 2009-04-08 10:26:04

@Charlie, I may sound a little nieve on this one, but what makes an expression regular?

leeand00 2009-04-08 12:14:04

@Pete Whenever I tried to do that back refs (look behind) with the article that I mentioned, all they did was reverse the string, and run a look ahead, and then un-reverse the string...I guess it depends on the size of the string, but I can't see how that would take up so much time.

leeand00 2009-04-08 12:16:23

@Pete, read the first graf: "XML content models are a form of extended context-free grammar, ..." Context free is not finite.

Charlie Martin 2009-04-08 13:41:08

@Jan, that's incorrect. While any bounded example if finite, eg, {3,5}. there's no upper bound in the grammar. You tell me the number of states in your machine, I'll construct two languages that it can't distinguish

Charlie Martin 2009-04-08 13:43:31

@lee, that's a bigger than I can do in a comment. Have a look at the wikipedia: http://en.wikipedia.org/wiki/Regular_language Basically, a regular language is one that can be recognized by a finite state machine.

Charlie Martin 2009-04-08 13:47:36

The unbounded x{3,} can be rewritten as xxxx* which is regular and can be implemented with a DFA with 4 states. Try it at http://osteele.com/tools/reanimator/

Jan Goyvaerts 2009-04-10 06:38:41

And if you say that x{7,1238103284} might be a problem, it's not. The state matchine will simply be a bit larger: 1238103285 states.

Jan Goyvaerts 2009-04-10 06:43:23

@ Charlie and Jan: I think Jan is correct here. a{3,5} implies aaa|aaaa|aaaaa which is a valid DFA, thus qualifying to be a regular language. Correct use of the pumping lemma is in the process of being verified.

Unknown 2009-04-10 07:23:22

@ Charlie: Say that you have x{3,5}. If you set the maximum pumping length to be 6 = 5+1 then, for all matching strings in the language L, they will not need to be confined to the lemmas due to the fact that they are all under the pumping length.

Unknown 2009-04-10 08:05:06

Guys, look. @Jan aaaa* is a regular language, as is any string constructed with catenation and Kleene star. Therefore it can be recognized by a DFA. @unk, maybe the issue is that any fixed n,m is finite strings, but arbitrary n,m aren't.

Charlie Martin 2009-04-10 15:03:56

@Charlie, but when you define n and m for your language, it must be finite. And once you use a Kleene star, then it will always be possible to split the a's such that the middle portion can be repeated due to Kleene repetition.

Unknown 2009-04-10 19:13:42

Backreferences rule out DFA (deterministic finite automaton), but there are other ways to solve the problem (e.g. recursive backtracking). Perl uses memoized backtracking recursion which removes a lot of the downsides to recursive backtracking (still eats a lot of memory on certain patterns though).

Chas. Owens 2009-04-08 11:56:37

ansaurus

tags:

views:

answers:

Which Regular Expression Algorithm does Javascript use for Regex?

Update

related questions