tags:

views:

204

answers:

4

It was already asked here, but the asker got satisfied with a 2 character finding answer. I repeat his basic question:

Generally, is there any way, how to say not contains string in the same way that I can say not contains character with [^a]?

I want to create a regexp that matches two ending strings and everything between, but only if no other occurance of a given string is found inside. But I will be satisfied best with the general answer to the quoted question

Example:

The strings are "<script>"and"</script>"

It should match

"<script> something something </script>"

but not

"<script> something <script> something something </script>"
+2  A: 

Please take a look this question

S.Mark
Yeah, I didn't find that. it started with matching a line, and i must have skipped reading the rest of it ;)
naugtur
+3  A: 

Did you read my answer to that question? It gives a more general solution. In your case it would look like this:

(?s)<script>(?:(?!</?script>).)*</script>

In other words: match the opening sequence; then match one character at a time, after ensuring that it's not the beginning of the closing sequence; then match the closing sequence.

Alan Moore
I still don't understand what is going on in the parentheses and why they don't match, but I'll figure it out. thanx
naugtur
This regex has unbalanced paranthesis. When I fix the expression, it doesn't match either of the strings.
Otto Allmendinger
@naugtur, I fixed the missing parenthesis. It might still not work, in which case your start and end tags are probably on separate lines. Try appending `(?s)` in front of the proposed regex, which will let the DOT meta char also match lines breaks: `(?s)<script>(?:(?!</script>).)*</script>`
Bart Kiers
Mea culpa! I should have tested it, even if I *have* posted it a dozen times before. Thanks, Bart.
Alan Moore
No problem Alan, it's comforting to see guys like you also make these (little) mistakes! ;)
Bart Kiers
The negative lookahead should be for `<script>` not `</script>`
Otto Allmendinger
@Otto: actually, it should be for both: `(?!</?script>)`; that matches the innermost set of possibly nested tags. Of course, `<script>` tags shouldn't *be* nested, but apparently the OP isn't really matching those. I should have read the question more closely. Fixing it now.
Alan Moore
In practice it's for </script> if You assume the tags have any sense. It's my example that is rather silly ;) First thing I've changed when using it was looking for !</script instead of !<script. If somebody nested a script it's better to remove all heading tags.
naugtur
+1  A: 

The correct expression for your problem is

"^<script>((?!<script>).)*</script>$"

This shouldn't be used for html manipulation. This doesn't address cases like

<script> foo <script type="javascript"> bar </script>

and many others. A parser is the correct solution here.

The more general expression for matching strings beginning with START, ending with END without the specific character sequence foobar in-between is:

"^START((?!foobar).)*END$"
Otto Allmendinger
I tuned it up and the input is a bit different, so there is no need to worry about html content.
naugtur
+1  A: 

Use negative lookahead. Lookarounds give zero width matches - meaning that they don't consume any characters in the source string.

var s1 = "some long string with the CENSORED word";
var s2 = "some long string without that word";
console.log(s1.match(/^(?!.*CENSORED).*$/));//no match
console.log(s2.match(/^(?!.*CENSORED).*$/));//matches the whole string

The syntax for negative lookahead is (?!REGEX). It searches for the REGEX and returns false if a match is found. Positive lookahead (?=REGEX) returns true if a match is found.

Amarghosh