views:

118

answers:

1
+3  A: 

I don't think regex is the best tool for the job, but if you just want to tweak and optimize what you have right now, you can use the word boundary \b, throw away the unnecessary capturing group and optional repetition specifier, and use possessive repetition:

\bworld\b(?![^<>]*+>)

The \bworld\b will ensure that "world" are surrounded by the zero-width word boundary anchors. This will prevent it from matching the "world" in "underworld" and "worldwide". Do note that the word boundary definition may not be exactly what you want, e.g. \bworld\b will not match the "world" in "a_world_domination".

The original pattern also contains a subpattern that looks like (x+)?. This is probably better formulated as simply x*. That is, instead of "zero-or-one" ? of "one-or-more" +, simply "zero-or-more" *.

The capturing group (…) is functionally not needed, and it doesn't seem like you need the capture for any substitution in the replacement, so getting rid of it can improve performance (when you need the grouping aspect, but not the capturing aspect, you can use non-capturing group (?:…) instead).

Note also that instead of [^<], we now forbid both brackets with [^<>]. Now the repetition can be specified as possessive since no backtracking is required in this case.

(The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.)

Of course (?!…) is negative lookahead; it asserts that a given pattern can NOT be matched. So the overall pattern reads like this:

\bworld\b(?![^<>]*+>)
\_______/\__________/ NOT the case that
 "world"                      the first bracket to its right is a closing one
 surrounded by
 word boundary anchors

References


Note that to get a backslash in a Java string literal, you need to double it, so the whole pattern as a Java string literal is "\\bworld\\b(?![^<>]*+>)".

polygenelubricants
i tried: String str = "world worldwide <a href=\"world\">world</world>underworld world";str = str.replaceAll("\bworld\b(?!([^<]+)?>)", "repl");System.out.println(str);but does not works...
celsowm
thanks ! and sorry for my silliness...
celsowm
@polygenelubricants: ok ! and thanks for the excellent explanation
celsowm
@polygenelubricants i have made a test with composite terms and unfortunately appeared a new problem...
celsowm
@celsowm: Please clarify the edit again, perhaps in a new question. Be very clear about what is that you want to do (the specification), what is that you've done (the attempt), and how you're not getting the result you want (the problem). If your specification is clear enough, you may also get non-regex solutions that may suit you better.
polygenelubricants