tags:

views:

478

answers:

4

Hi to all,

I just using regular expression to remove comment tag in html file (Pattern is: "<!--(.|\s)*?--!?>")

but some site not using standard html tag, sample:

<script language="javascript">
    <!-- 
     js code ...
    </script>

in this case, my regular expression hang, and also try-catch does not catch error. Are you have any idea ?

A: 

You could evaluate the regex in a thread and abort the thread after a given timeout

Thomas Levesque
That does not help with a regex that takes forever. The thread will always abort after the timeout.
Jan Goyvaerts
No, I can't using this way, thanks for your reply
Hamid
+2  A: 

You can rewrite the regex so that it fails as quickly as possible when no match is possible, like so:

<!--(?>(?:[^-]+|-(?!->))*)-->

If the unclosed comment in your example is followed by a complete comment further along, this regex will match from the the first <!-- to the first -->, like this:

<!-- blah <!-- blah -->

That's how your browser is supposed to treat SGML comments. In fact, if there's no matching -->, everything after the <!-- is commented out. So the regex should really be:

<!--(?>(?:[^-]+|-(?!->))*)(?:-->|\z)

But I suspect this isn't really what you want. For a better answer, we'll need to know what you want to do about malformed HTML like the snippet you posted.

Alan Moore
+1  A: 

As Ben mentioned, this failure is due to catastrophic backtracking. Enlightening reading if you are unfamiliar with regex.

patjbs
Dear patjbs, regEx is simple for me, but text that send to it is unkown (every site use standard html or not that is bad for me)
Hamid
+2  A: 

The problem with the performance of your regex is trivial. Don't do this:

(.|\s)*

Whether the quantifier is lazy or greedy is entirely besides the point. The problem is that . and \s are not mutually exclusive. Spaces can be matched by both . and \s. Thus if your regex encounters a space, it will first match the space with ., and if the remainder of the regex fails, it will match it again with \s. If you have two spaces, it will first match both with ., then the first with . and the second with \s, then the frist with \s and the second with ., and then both with \s. As you can see, your regex has a complexity of O(2^N) when it encounters a run of spaces followed by something the remainder of the regex cannot match. If you have 10 spaces, there are 1024 permutations. If you have 32 spaces, there are 4 billion permutations.

The reason you only see the problem when your regex fails is that when the regex succeeds, the . simply matches all the spaces, and \s never gets any of the action.

I know what you're trying to do: you want to match a run of "any" character, including line breaks, which are normally not matched by the dot. The proper solution is to set RegexOptions.SingleLine and use this regex instead:

.*

If you can't set RegexOptions.SingleLine, use this mode modifier to do the same:

(?s).*

If you can't use that mode modifier, e.g. because JavaScript doesn't support it, use a character class with two complementary shorthands:

[\S\s]*

Once you get that horrid (.|\s) alternation out of your regex it will work perfectly. There's no need to use any of the complicated regexes the others suggested. A single lazy quantifier always expands linearly. Alternation that is not mutually exclusive always kills your regex. I indeed call this catastrophic backtracking.

And if you want a regex that allows a tag to terminate a comment, try this:

(?s)<!--.*?(-->|</script>)
Jan Goyvaerts