tags:

views:

453

answers:

8

My Questions is

Is there any regular expression engine which do Just-In-Time compiling during regex pattern parsing and use when matching/replacing the texts? or where can I learn JIT for i386 or x64 architecture?

Why I need that is,

I recently trying to benchmark python's built-in regex engine with normal C codes with around 10M data.

I found that for normal replace (for example ab to zzz) is relatively fast like just 2 to 3 times different to C

but for [a-z]c tooks around 5 to 8 times slower than C,

and with grouping (for example - ([a-z])(c) to AA\2\1BB ) its tooks 20 to 40 times slower than C.

Its not Just-In-Time compiling yet, but I think If I could do just In time compling, It could faster a lot more.

ps: I use profiling for each regex patterns during compling patterns, for eg, profile 1 for simple one like ab, profile 2 for range [a-z]c, profile 3 with grouping ([a-z])(c), each profile has seperate codes, so no extra cost needed when matching, and replacing simple patterns.

Any Ideas would be appreciated, Thanks in advance.

Update 1:

I have tried with psyco, and Its doesnot improve the speed that much. May be because I am doing text replacing against big data, not looping many times. If I am not wrong, Python's re.sub running it in natively already I think, so pysco cannot improve the speed that much.

Update 2:

I have tried with boost regex wrapped into python, but its even slower than python's regex, so It seems like the bottleneck is in python's string processing and Jan Goyvaerts also pointing me that point in the answer.

Update

I like to convert regex pattern ab[a-z]c to machine codes, like following equivlent C codes.

*s points to 10M Long Texts

do{
    if(*s=='a' && s[1]=='b' && s[2]>='a' && s[2]<='z' && s[3]=='c') return 1;
}while(*s++);
return 0;

any ideas?

+1  A: 

I'm no Python expert, but you could give Psycho a try:

http://www.ibm.com/developerworks/library/l-psyco.html

http://psyco.sourceforge.net/

Bart Kiers
Thanks, I will try it with again.
S.Mark
You're welcome Mark. Note that it may very well be a bit trickier than simply adding a `psyco.jit()` at the top of your script: be sure to read the IBM tutorial. Best of luck!
Bart Kiers
@Bart - I see, I just tried with psyco.full(), ([a-z])(c) patterns become 25x previously 30x slower, on my current pc with 7.8M Characters. I am reading IBM tutorial now. thx. again.
S.Mark
+2  A: 

So if I understand you correctly, you use a programming language that by default does not do just-in-time compiling and now you are looking for a regex library that does precisely that?

I think you should compile all of your python code to binary using e.g. Psyco

http://www.devshed.com/c/a/Python/How-Python-Runs-Programs/4/

also discussed here:

http://stackoverflow.com/questions/138521/is-it-feasible-to-compile-python-to-machine-code

and here:

http://stackoverflow.com/questions/205062/is-it-possible-to-compile-python-natively-beyond-pyc-byte-code

If these solutions either don't work or are still not fast enough and if you absolutely want to write the rest of your application in python, there is the boost python c++ library:

http://www.boost.org/doc/libs/1_41_0/libs/python/doc/index.html

The boost.python library allows full interoperability between python and c++. Then, you could use the boost.regex c++ regex matcher:

http://www.boost.org/doc/libs/1_41_0/libs/regex/doc/html/index.html

Sebastian
thanks, I will try to test with psyco enabled again.
S.Mark
>> you are looking for a regex library that does precisely that.yes, kind of, I just tested with psyco, its does not improve that much actually, my idea is that, to translate [a-z]c to machine codes that do similar thing like hard-coding in C. thx
S.Mark
I've added the reference to boost.python and boost.regex.
Sebastian
thx I will try those.
S.Mark
+1  A: 

The regular expression engine in Firefox compiles some (not all!) regular expressions to machine code. I believe Safari and Chrome do too.

Jason Orendorff
+1  A: 

I don't see it in your question, so I ask: Did you test with precompiled regular expressions e.g. "re.compile(pattern)" ??

Since compiled regexes should be faster. OK, it is not JIT, but most of the time you are fine with simply precompiled ones!

See here:

re.compile

Juergen
Oh ok, I didnt realize that doing re.compile is faster, let me test that.
S.Mark
Seems like re.compile didnt help any improvement for me, btw, I am doing against 7.8M Characters with re.sub("([a-z])(c)","AA\\2\\1BB", not like I am doing re.compile in Million Loops.thx anyway
S.Mark
Oh, I see your problem now more clearly.
Juergen
But you should do `re.compile` **before** the loop, then run the compiled regex within the loop.
Daniel Roseman
>> But you should do re.compile before the loop, Ah, mine is not inside the loop, I meant I am not Looping Regex, the test data is just big like ~8 M characters, thx
S.Mark
The re module compiles and caches string regexes, so pre-compiling the regex often doesn't change the speed of the matching.
Ned Batchelder
+2  A: 

Another idea: When you have a library (in C) that is more optimal than the Python regex module or that does just-in-time compilation of Regexes, then you could write your own regex module for python that does just wrap your C-Library.

That of course is somewhat more work and only recommended when you really, really need the speed.

You could also try Cython (personally I did not use it yet, but it sounds rather good) to do the job of wrapping.

As much as I understand your problem now, the Python surrounding is not your problem (so I doubt whether psyco will help) -- also the preparation of the regex-run is not your problem, but the run itself must be top-speed. That of course depends on the library you use and how good it can handle large strings. I would think, that the standard python regex-lib is not optimized for such long strings and top-of-the-notch speed.

Juergen
Thanks for the hints, Actually, I already wrapped my C codes to python and testing python regex and my hardcorded C code as python extension.
S.Mark
+2  A: 

I may be wrong, but I believe that Python's regex module is in C, so any suggestion to compile Python (like using Psycho) would not make much difference---what you're actually comparing is the performance of one C regex library (Python's) with another (whatever library you are using).

Tommy McGuire
Yeah, I am also thinking like that too. Actual codes are in sre.c If I am not wrong.
S.Mark
+2  A: 

The only regex engine that I know that can compile regular expressions into executable code is the one in .NET when you pass RegexOptions.Compiled. That causes the Regex class to emit MSIL which can then be JITted like any other .NET code.

Whether than makes the .NET regex engine faster than others is a totally different matter. When searching and replacing using relatively simple regular expressions on large data sets, string handling becomes far more important. .NET strings are immutable, so much will depend on how many times the string needs to be reallocated.

Hand-coding the operation will always be faster, because the code isn't equivalent. The regex code maintains certain information about the regex match and the capturing groups which your code does not. In most situations, the extra time you spend hand-coding the search-and-replace instead of using a regex isn't worth the effort, particularly if you factor in that switching to a different regex is trivial when your requirements change, while rewriting the search-and-replace using procedural code takes much more time.

In my experience, PCRE is one of the fastest regex engines around. It doesn't include a ready-made search-and-replace, however.

Jan Goyvaerts
Thanks for information about .NET regex and Great Explanations.
S.Mark
A: 

Thompson had a paper published in the communications of the ACM in 1968 that described a working JIT compiler for regular expressions into IBM 7094 code. I don't know what language(s) he used; Fortran or LISP would be the obvious suspects, with LISP being especially suspect since it already had JIT compiling.

Christopher Creutzig