ansaurus

Question

Answer 1

+1 A:

[^ ... ] will match one character that is not any of ...

So your pattern "[^(abc)]" is saying "match one character that is not a, b, c or the left or right bracket"; and indeed that is what happens in your test.

It is hard to say "replace all characters that are not part of the string 'abc'" in a single trivial regular expression. What you might do instead to achieve what you want could be some nasty thing like

while the input string still contains "abc"
   find the next occurrence of "abc"
   append to the output a string containing as many "+"s as there are characters before the "abc"
   append "abc" to the output string
   skip, in the input string, to a position just after the "abc" found
append to the output a string containing as many "+"s as there are characters left in the input

or possibly if the input alphabet is restricted you could use regular expressions to do something like

replace all occurrences of "abc" with a single character that does not occur anywhere in the existing string
replace all other characters with "+"
replace all occurrences of the target character with "abc"

which will be more readable but may not perform as well

moonshadow 2009-10-23 07:36:55

Yeah, that's what I would do. But trying to do it with a regex is a nice puzzle.

Thilo 2009-10-23 07:47:25

Answer 2

A:

Negating regexps is usually troublesome. I think you might want to use negative lookahead. Something like this might work:

String pattern = "(?<!ab).(?!abc)";

I didn't test it, so it may not really work for degenerate cases. And the performance might be horrible too. It is probably better to use a multistep algorithm.

Edit: No I think this won't work for every case. You will probably spend more time debugging a regexp like this than doing it algorithmically with some extra code.

Mario 2009-10-23 07:38:59

Answer 3

+9 A:

What you are trying to achieve is pretty tough with regular expressions, since there is no way to express “replace strings not matching a pattern”. You will have to use a “positive” pattern, telling what to match instead of what not to match.

Furthermore, you want to replace every character with a replacement character, so you have to make sure that your pattern matches exactly one character. Otherwise, you will replace whole strings with a single character, returning a shorter string.

For your toy example, you can use negative lookaheads and lookbehinds to achieve the task, but this may be more difficult for real-world examples with longer or more complex strings, since you will have to consider each character of your string separately, along with its context.

Here is the pattern for “not ‘abc’”:

[^abc]|a(?!bc)|(?<!a)b|b(?!c)|(?<!ab)c

It consists of five sub-patterns, connected with “or” (|), each matching exactly one character:

[^abc] matches every character except a, b or c
a(?!bc) matches a if it is not followed by bc
(?<!a)b matches b if it is not preceded with a
b(?!c) matches b if it is not followed by c
(?<!ab)c matches c if it is not preceded with ab

The idea is to match every character that is not in your target word abc, plus every word character that, according to the context, is not part of your word. The context can be examined using negative lookaheads (?!...) and lookbehinds (?<!...).

You can imagine that this technique will fail once you have a target word containing one character more than once, like example. It is pretty hard to express “match e if it is not followed by x and not preceded by l”.

Especially for dynamic patterns, it is by far easier to do a positive search and then replace every character that did not match in a second pass, as others have suggested here.

Ferdinand Beyer 2009-10-23 07:50:21

Nice explanation. +1

jensgram 2009-10-23 08:37:38

Answer 4

A:

Try to solve it without regular expressions:

String out = "";
int i;
for(i=0; i<text.length() - pattern.length() + 1; ) {
    if (text.substring(i, i + pattern.length()).equals(pattern)) {
        out += pattern;
        i += pattern.length();
    }
    else {
        out += "+";
        i++;
    }
}
for(; i<text.length(); i++) {
    out += "+";
}

kgiannakakis 2009-10-23 07:54:50

Answer 5

A:

Rather than a single replaceAll, you could always try something like:

   @Test
    public void testString() {
        final String in = "abXYabcXYabcHIH";
        final String expected = "xxxxabcxxabcxxx";
        String result = replaceUnwanted(in);
        assertEquals(expected, result);
    }

    private String replaceUnwanted(final String in) {
        final Pattern p = Pattern.compile("(.*?)(abc)([^a]*)");
        final Matcher m = p.matcher(in);
        final StringBuilder out = new StringBuilder();
        while (m.find()) {
            out.append(m.group(1).replaceAll(".", "x"));
            out.append(m.group(2));
            out.append(m.group(3).replaceAll(".", "x"));
        }
        return out.toString();
    }

toolkit 2009-10-23 08:00:51

Answer 6

A:

Instead of using replaceAll(...), I'd go for a Pattern/Matcher approach:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static String plusOut(String str, String pattern) {
        StringBuilder builder = new StringBuilder();
        String regex = String.format("((?:(?!%s).)++)|%s", pattern, pattern);
        Matcher m = Pattern.compile(regex).matcher(str.toLowerCase());
        while(m.find()) {
            builder.append(m.group(1) == null ? pattern : m.group().replaceAll(".", "+"));
        }
        return builder.toString();
    }

    public static void main(String[] args) {
        String text = "abXYabcXYZ";
        String pattern = "abc";
        System.out.println(plusOut(text, pattern));
    }

}

Note that you'll need to use Pattern.quote(...) if your String pattern contains regex meta-characters.

Edit: I didn't see a Pattern/Matcher approach was already suggested by toolkit (although slightly different)...

Bart Kiers 2009-10-23 08:50:39

ansaurus

tags:

views:

answers:

Regular Expression problem in Java

related questions