views:

275

answers:

6

I would like to let my users use regular expressions for some features. I'm curious what the implications are of passing user input to re.compile(). I assume there is no way for a user to give me a string that could let them execute arbitrary code. The dangers I have thought of are:

  1. The user could pass input that raises an exception.
  2. The user could pass input that causes the regex engine to take a long time, or to use a lot of memory.

The solution to 1. is easy: catch exceptions. I'm not sure if there is a good solution to 2. Perhaps just limiting the length of the regex would work.

Is there anything else I need to worry about?

A: 

I really don't think it is possible to execute code simply by passing it into an re.compile. The way I understand it, re.compile (or any regex system in any language) converts the regex string into a finite automaton (DFA or NFA), and despite the ominous name 'compile' it has nothing to do with the execution of any code.

MAK
A: 

You technically don't need to use re.compile() to perform a regular expression operation on a string. In fact, the compile method can often be slower if you're only executing the operation once since there's overhead associated with the initial compiling.

If you're worried about the word "compile" then avoid it all together and simply pass the raw expression to match, search, etc. You may wind up improving the performance of your code slightly anyways.

Soviut
I think this is somewhat besides the point. To perform the actual search, `match` would have to do the compile step anyway, which is what the OP is worried about.
wds
+1  A: 

It's not necessary to use compile() except when you need to reuse a lot of different regular expressions. The module already caches the last expressions.

The point 2 (at execution) could be a very difficult one if you allow the user to input any regular expression. You can make a complex regexp with few characters, like the famous (x+x+)+y one. I think it's a problem yet to be resolved in a general way. A workaround could be launching a different thread and monitor it, if it exceeds the allowed time, kill the thread and return with an error.

Khelben
+3  A: 

Compiling the regular expression should be reasonably safe. Although what it compiles into is not strictly an NFA (backreferences mean it's not quite as clean) it should still be sort of straightforward to compile into.

Now as to performance characteristics, this is another problem entirely. Even a small regular expression can have exponential time characteristics because of backtracking. It might be better to define a certain subset of features and only support very limited expressions that you translate yourself.

If you really want to support general regular expressions you either have to trust your users (sometimes an option) or limit the amount of space and time used. I believe that space used is determined only by the length of the regular expression.

edit: As Dave notes, apparently the global interpreter lock is held during regex matching, which would make setting that timeout harder. If that is the case, your only option to set a timeout is to run the match in a separate process. While not exactly ideal it is doable. I completely forgot about multiprocessing. Point of interest is this section on sharing objects. If you really need the hard constraints, separate processes are the way to go here.

wds
Using a separate thread to implement a timeout does not work since python holds the GIL while doing a match - see my answer. Even if you patched re to release the GIL you need to add some way of killing a thread running a regex - not trivial!
Dave Kirby
My mistake, that is pretty awesomely annoying then. I'll edit my answer to something a bit more vague but possible.
wds
+13  A: 

I have worked on a program that allows users to enter their own regex and you are right - they can (and do) enter regex that can take a long time to finish - sometimes longer than than the lifetime of the universe. What is worse, while processing a regex Python holds the GIL, so it will not only hang the thread that is running the regex, but the entire program.

Limiting the length of the regex will not work, since the problem is backtracking. For example, matching the regex r"(\S+)+x" on a string of length N that does not contain an "x" will backtrack 2**N times. On my system this takes about a second to match against "a"*21 and the time doubles for each additional character, so a string of 100 characters would take approximately 19167393131891000 years to complete (this is an estimate, I have not timed it).

For more information read the O'Reilly book "Mastering Regular Expressions" - this has a couple of chapters on performance.

edit To get round this we wrote a regex analysing function that tried to catch and reject some of the more obvious degenerate cases, but it is impossible to get all of them.

Another thing we looked at was patching the re module to raise an exception if it backtracks too many times. This is possible, but requires changing the Python C source and recompiling, so is not portable. We also submitted a patch to release the GIL when matching against python strings, but I don't think it was accepted into the core (python only holds the GIL because regex can be run against mutable buffers).

Dave Kirby
+1 for "(this is an estimate, I have not timed it)"
Chinmay Kanchi
I guess I could spawn another process then and kill it if it times out after too long?
Skeletron
spawning and killing will work, but add a considerable overhead for running each match. Whether that is an acceptable price to pay is up to you.
Dave Kirby
+4  A: 

It's much simpler for casual users to give them a subset language. The shell's globbing rules in fnmatch, for example. The SQL LIKE condition rules are another example.

Translate the user's language into a proper regex for execution at runtime.

S.Lott
+1 the only safe way to go
bobince