My Questions is
Is there any regular expression engine which do Just-In-Time compiling during regex pattern parsing and use when matching/replacing the texts? or where can I learn JIT for i386 or x64 architecture?
Why I need that is,
I recently trying to benchmark python's built-in regex engine with normal C codes with around 10M data.
I found that for normal replace (for example ab to zzz) is relatively fast like just 2 to 3 times different to C
but for [a-z]c
tooks around 5 to 8 times slower than C,
and with grouping (for example - ([a-z])(c)
to AA\2\1BB
) its tooks 20 to 40 times slower than C.
Its not Just-In-Time compiling yet, but I think If I could do just In time compling, It could faster a lot more.
ps: I use profiling for each regex patterns during compling patterns,
for eg, profile 1 for simple one like ab
, profile 2 for range [a-z]c
, profile 3 with grouping ([a-z])(c)
, each profile has seperate codes, so no extra cost needed when matching, and replacing simple patterns.
Any Ideas would be appreciated, Thanks in advance.
Update 1:
I have tried with psyco, and Its doesnot improve the speed that much. May be because I am doing text replacing against big data, not looping many times. If I am not wrong, Python's re.sub running it in natively already I think, so pysco cannot improve the speed that much.
Update 2:
I have tried with boost regex wrapped into python, but its even slower than python's regex, so It seems like the bottleneck is in python's string processing and Jan Goyvaerts also pointing me that point in the answer.
Update
I like to convert regex pattern ab[a-z]c
to machine codes, like following equivlent C codes.
*s points to 10M Long Texts
do{
if(*s=='a' && s[1]=='b' && s[2]>='a' && s[2]<='z' && s[3]=='c') return 1;
}while(*s++);
return 0;
any ideas?