ansaurus

Question

Is there any regular expression engine which do Just-In-Time compiling?

Answer 1

+1 A:

I'm no Python expert, but you could give Psycho a try:

http://www.ibm.com/developerworks/library/l-psyco.html

http://psyco.sourceforge.net/

Bart Kiers 2009-11-20 08:47:07

Thanks, I will try it with again.

S.Mark 2009-11-20 08:51:12

You're welcome Mark. Note that it may very well be a bit trickier than simply adding a `psyco.jit()` at the top of your script: be sure to read the IBM tutorial. Best of luck!

Bart Kiers 2009-11-20 08:54:34

@Bart - I see, I just tried with psyco.full(), ([a-z])(c) patterns become 25x previously 30x slower, on my current pc with 7.8M Characters. I am reading IBM tutorial now. thx. again.

S.Mark 2009-11-20 09:03:21

Answer 2

+2 A:

So if I understand you correctly, you use a programming language that by default does not do just-in-time compiling and now you are looking for a regex library that does precisely that?

I think you should compile all of your python code to binary using e.g. Psyco

http://www.devshed.com/c/a/Python/How-Python-Runs-Programs/4/

also discussed here:

http://stackoverflow.com/questions/138521/is-it-feasible-to-compile-python-to-machine-code

and here:

http://stackoverflow.com/questions/205062/is-it-possible-to-compile-python-natively-beyond-pyc-byte-code

If these solutions either don't work or are still not fast enough and if you absolutely want to write the rest of your application in python, there is the boost python c++ library:

http://www.boost.org/doc/libs/1_41_0/libs/python/doc/index.html

The boost.python library allows full interoperability between python and c++. Then, you could use the boost.regex c++ regex matcher:

http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/html/index.html

Sebastian 2009-11-20 08:47:17

thanks, I will try to test with psyco enabled again.

S.Mark 2009-11-20 08:49:44

>> you are looking for a regex library that does precisely that.yes, kind of, I just tested with psyco, its does not improve that much actually, my idea is that, to translate [a-z]c to machine codes that do similar thing like hard-coding in C. thx

S.Mark 2009-11-20 09:19:16

I've added the reference to boost.python and boost.regex.

Sebastian 2009-11-20 09:43:12

thx I will try those.

S.Mark 2009-11-20 10:07:25

Answer 3

+1 A:

The regular expression engine in Firefox compiles some (not all!) regular expressions to machine code. I believe Safari and Chrome do too.

Jason Orendorff 2009-11-20 08:52:12

Answer 4

+1 A:

I don't see it in your question, so I ask: Did you test with precompiled regular expressions e.g. "re.compile(pattern)" ??

Since compiled regexes should be faster. OK, it is not JIT, but most of the time you are fine with simply precompiled ones!

See here:

re.compile

Juergen 2009-11-20 08:52:40

Oh ok, I didnt realize that doing re.compile is faster, let me test that.

S.Mark 2009-11-20 09:05:27

Seems like re.compile didnt help any improvement for me, btw, I am doing against 7.8M Characters with re.sub("([a-z])(c)","AA\\2\\1BB", not like I am doing re.compile in Million Loops.thx anyway

S.Mark 2009-11-20 09:12:03

Oh, I see your problem now more clearly.

Juergen 2009-11-20 09:34:20

But you should do `re.compile` **before** the loop, then run the compiled regex within the loop.

Daniel Roseman 2009-11-20 09:41:30

>> But you should do re.compile before the loop, Ah, mine is not inside the loop, I meant I am not Looping Regex, the test data is just big like ~8 M characters, thx

S.Mark 2009-11-20 09:45:22

The re module compiles and caches string regexes, so pre-compiling the regex often doesn't change the speed of the matching.

Ned Batchelder 2009-11-20 19:21:34

Answer 5

+2 A:

Another idea: When you have a library (in C) that is more optimal than the Python regex module or that does just-in-time compilation of Regexes, then you could write your own regex module for python that does just wrap your C-Library.

That of course is somewhat more work and only recommended when you really, really need the speed.

You could also try Cython (personally I did not use it yet, but it sounds rather good) to do the job of wrapping.

As much as I understand your problem now, the Python surrounding is not your problem (so I doubt whether psyco will help) -- also the preparation of the regex-run is not your problem, but the run itself must be top-speed. That of course depends on the library you use and how good it can handle large strings. I would think, that the standard python regex-lib is not optimized for such long strings and top-of-the-notch speed.

Juergen 2009-11-20 09:38:52

Thanks for the hints, Actually, I already wrapped my C codes to python and testing python regex and my hardcorded C code as python extension.

S.Mark 2009-11-20 10:00:26

Answer 6

+2 A:

I may be wrong, but I believe that Python's regex module is in C, so any suggestion to compile Python (like using Psycho) would not make much difference---what you're actually comparing is the performance of one C regex library (Python's) with another (whatever library you are using).

Tommy McGuire 2009-11-20 18:06:28

Yeah, I am also thinking like that too. Actual codes are in sre.c If I am not wrong.

S.Mark 2009-11-20 18:23:40

Answer 7

+2 A:

The only regex engine that I know that can compile regular expressions into executable code is the one in .NET when you pass RegexOptions.Compiled. That causes the Regex class to emit MSIL which can then be JITted like any other .NET code.

Whether than makes the .NET regex engine faster than others is a totally different matter. When searching and replacing using relatively simple regular expressions on large data sets, string handling becomes far more important. .NET strings are immutable, so much will depend on how many times the string needs to be reallocated.

Hand-coding the operation will always be faster, because the code isn't equivalent. The regex code maintains certain information about the regex match and the capturing groups which your code does not. In most situations, the extra time you spend hand-coding the search-and-replace instead of using a regex isn't worth the effort, particularly if you factor in that switching to a different regex is trivial when your requirements change, while rewriting the search-and-replace using procedural code takes much more time.

In my experience, PCRE is one of the fastest regex engines around. It doesn't include a ready-made search-and-replace, however.

Jan Goyvaerts 2009-11-23 14:30:52

Thanks for information about .NET regex and Great Explanations.

S.Mark 2009-11-23 15:14:03

Answer 8

A:

Thompson had a paper published in the communications of the ACM in 1968 that described a working JIT compiler for regular expressions into IBM 7094 code. I don't know what language(s) he used; Fortran or LISP would be the obvious suspects, with LISP being especially suspect since it already had JIT compiling.

Christopher Creutzig 2010-02-04 15:17:24

ansaurus

tags:

views:

answers:

Is there any regular expression engine which do Just-In-Time compiling?

related questions