tags:

views:

86

answers:

6

We have a configuration file that lists a series of regular expressions used to exclude files for a tool we are building (it scans .class files). The developer has appended all of the individual regular expressions into a single one using the OR "|" operator like this:

rx1|rx2|rx3|rx4

My gut reaction is that there will be an expression that will screw this up and give us the wrong answer. He claims no; they are ORed together. I cannot come up with case to break this but still fee uneasy about the implementation.

Is this safe to do?

+1  A: 

It's as safe as anything else in regular expressions!

Cade Roux
-1 What is the point of this answer?
Chris Lutz
The point is that there is nothing inherently unsafe with regular expressions - they either match or they don't. What is unsafe is a system where the regular expression encounters boundary conditions or unexpected/unanticipated input. In this case, the regular expression is very simple, but the author gives no indication of any other controls or specification for the input. Because regular expressions are deterministic, I stand by my answer.
Cade Roux
A: 

As far as regexes go , Google code search provides regexes for searches so ... it's possible to have safe regexes

xxxxxxx
+2  A: 

As long as they are valid regexes, it should be safe. Unclosed parentheses, brackets, braces, etc would be a problem. You could try to parse each piece before adding it to the main regex to verify they are complete.

Also, some engines have escapes that can toggle regex flags within the expression (like case sensitivity). I don't have enough experience to say if this carries over into the second part of the OR or not. Being a state machine, I'd think it wouldn't.

YotaXP
A: 

I don't see any possible problem too.

I guess by saying 'Safe' you mean that it will match as you needed (because I've never heard of RegEx security hole). Safe or not, we can't tell from this. You need to give us more detail like what the full regex is. Do you wrap it with group and allow multiple? Do you wrap it with start and end anchor?

If you want to match a few class file name make sure you use start and end anchor to be sure the matching is done from start til end. Like this "^(file1|file2)\.class$". Without start and end anchor, you may end up matching 'my_file1.class too'

NawaMan
Don’t forget to escape the dot.
Gumbo
Thanks, I will edit that :D
NawaMan
+2  A: 

Not only is it safe, it's likely to yield better performance than separate regex matching.

Take the individual regex patterns and test them. If they work as expected then OR them together and each one will still get matched. Thus, you've increased the coverage using one regex rather than multiple regex patterns that have to be matched individually.

Ahmad Mageed
A: 

The answer is that yes this is safe, and the reason why this is safe is that the '|' has the lowest precedence in regular expressions.

That is:

regexpa|regexpb|regexpc

is equivalent to

(regexpa)|(regexpb)|(regexpc)

with the obvious exception that the second would end up with positional matches whereas the first would not, however the two would match exactly the same input. Or to put it another way, using the Java parlance:

String.matches("regexpa|regexpb|regexpc");

is equivalent to

String.matches("regexpa") | String.matches("regexpb") | String.matches("regexpc");
Paul Wagland