tags:

views:

233

answers:

2

In Python Regular Expressions,

re.compile("x"*50000)

gives me OverflowError: regular expression code size limit exceeded

but following one does not get any error, but it hits 100% CPU, and took 1 minute in my PC

>>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)
<_sre.SRE_Pattern object at 0x03FB0020>

Is that normal?

Should I assume, ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 is shorter than "x"*50000?

Tested on Python 2.6, Win32

UPDATE 1:

It Looks like ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 could be reduce to .*?

So, how about this one?

re.compile(".*?x"*50000)

It does compile, and if that one also can reduce to ".*?x", it should match to string "abcx" or "x" alone, but it does not match.

So, Am I missing something?

UPDATE 2:

My Point is not to know max limit of regex source strings, I like to know some reasons/concepts of "x"*50000 caught by overflow handler, but not on ".*?x"*50000.

It does not make sense for me, thats why.

It is something missing on overflow checking or Its just fine or its really overflowing something?

Any Hints/Opinions will be appreciated.

+4  A: 

The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).

EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.

Lukáš Lalinský
Thanks, but how about `re.compile(".*?x"*50000)`?
S.Mark
I don't really know about internals about the Python regex engine, so I'm not sure. The regex should match 50000 x's with any characters between them. I don't know what optimalizations does it do, but it's likely that it does something special about the regex. FYI, all of the regexes work on Linux.
Lukáš Lalinský
Thanks for info about linux version, I just checked _sre.c -> _compile function, there is no specific code for windows, so it may be because of some size different like `wchar_t` and/or your python is compiled with `Unicode=UCS4`
S.Mark
`"x"*50000 will generate one long item and ".*?x"*50000 will generate many small items`, Thank you, that make sense.
S.Mark
+2  A: 

you want to match 50000 "x"s , correct??? if so, an alternative without regex

if "x"*50000 in mystring:
    print "found"

if you want to match 50000 "x"s using regex, you can use range

>>> pat=re.compile("x{50000}")
>>> pat.search(s)
<_sre.SRE_Match object at 0xb8057a30>

on my system it will take in length of 65535 max

>>> pat=re.compile("x{65536}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/re.py", line 188, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.6/re.py", line 241, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python2.6/sre_compile.py", line 529, in compile
    groupindex, indexgroup
RuntimeError: invalid SRE code
>>> pat=re.compile("x{65535}")
>>>

I don't know if there are tweaks in Python we can use to increase that limit though.

ghostdog74
+1 Thanks, but I am looking for the reason, why for regex
S.Mark
Thanks for the update for the code, but `{65535}` is the repetation limit, thats a bit different with mine. `"x"*50000` and `"x{50000}` is different in my understanding.
S.Mark
"x"*50000 produces 50000 x's.. in a regex, if you put x{50000}, you are telling the regex engine to search for 50000 x's as well... Or am i missing something ? you should clearly state what you want to do in your question with examples...
ghostdog74
@ghostdog74, you are right, I can use x{50000} if I want to match 50000 x. I am just wondering re.compile is safe or not and testing for vulnerablity. thats all.
S.Mark
well, as you can see, string length exceeding 65535 (for my system) crashes the regex engine and produce error. I would suggest checking your string according to your application specs before passing it to re.compile.
ghostdog74