ansaurus

Question

Can one use .NET RegEx backreference in a character class to exclude previously matched characters?

Answer 1

+1 A:

The answer is: no. You cannot use backreferences in character classes of .NET regular expressions. Sorry. See below for a workaround for your situation.

"it shows that \p{name} is valid syntax in a character class"

yes, it is. But the .NET documentation does not say that name will be interpreted from a backreference. It must be a unicode literal class string.

"In other words, it doesn't take into account that each letter is distinct; no repeats."

I understand that this means to match all of e f a x and only f and x in e f e x. In other words: match unique characters, do not match characters that are repeated.

Solution

I understand your question as follows: match all unique words (subexpressions, characters) in a string that have no repetition before or after itself. The basic regular expression you should use is this:

(subexpr)(?!.*\1)(?<!\1.+)

which will find the word subexpr only when it appears once in the matching string. for instance, if we change it to match the e in e f a x and not in e f e x, it will look like this:

(e)(?!.*\1)(?<!\1.+)

You can generalize this to match every unique letter in the string:

(.)(?!.*\1)(?<!\1.+)

if will match e, f, a and x in e f a x and only f and x in e f e x. This could be the generalized replacement for your expression above and you don't need to repeat the 1,2,3 etc captures anymore.

How it works

(update) Perhaps nice to know how the above regex works:

(subexpr)   # grab subexpression (can be any valid grouped regex)
(?!.*\1)    # negative look forward with a backrefence: if followed somewhere by itself, fail
(?<!\1.+)   # negative look backward with backref: if preceded somewhere by itself, fail

Applied solution

A word has a pattern. SUCCUBUS is 1 2 3 3 2 4 2 1. PAST is 1 2 3 4. Based on that pattern, a regex should match words with the same pattern: same length of the word, repetition of the same letters in the same place: PAST and RANT have the same pattern. LOOK and HEEL have the same pattern, but not HERE.

Taking the previous solution, we adjust that generally to your problem domain by sticking to the following rules:

A unique letter is represented by (.)(?!.*\X)(?<!\X.+)
A repeated letter is represented by (.)
The location where repetition occurs is represented by \X (no parentheses!)
\X means a backreference with the number of your pattern

Examples:

# SUCCUBUS is 1 2 3 3 2 4 2 1 (only 4 is unique)
(.)                      # nr 1 in pattern
(.)                      # nr 2 in pattern
(.)                      # nr 3 in pattern
\3                       # repeat 3
\2                       # repeat 2
(.)(?!.*\4)(?<!\4.+)     # nr 4 UNIQUE!
\2                       # repeat 2
\1                       # repeat 1

# PAST (all unique: 1 2 3 4)
(.)(?!.*\1)(?<!\1.+)    # nr 1 in pattern
(.)(?!.*\2)(?<!\2.+)    # nr 2 in pattern
(.)(?!.*\3)(?<!\3.+)    # nr 3 in pattern
(.)(?!.*\4)(?<!\4.+)    # nr 4 in pattern

This pattern should be easily automated into your current system.

An excellent way to test this and other regexes (just copy and paste mine) is at Regex Hero, free online SilverLight .NET regex tester. For other online testers, see my overview chart of them.

update: removed earlier irrelevant update note

Update 1: In a comment in the other solution, you say that you want to be able to match a substring that fits the pattern. Naturally, this poses a challenge with the negative look-forward/behind: as they are now, they look through the whole string. Replace the .* and .+ with the relative length the expression would be at the place, pos 3 for PAST becomes then (.)(?!.{1}\3)(?<!\3.{2}) and pos 4 would become (.)(?!.{2}\3)(?<!\3.{3})

Update 2: in the same way, it is possible to optimize slightly by removing the look-back in the first expression and removing the look-forward in the last, if they need to be unique: pos 1 becomes (.)(?!.{3}\3) and pos 4 becomes (.)(?<!\3.{3})

Abel 2009-10-14 17:38:04

if the input is of the form "1 2 3 4", that means we're looking for a 4 letter word where each character is distinct. So yes, in that case, efax would be a match, but efex wouldn't. How do you state that logic in a regex (if possible)?

Roly 2009-10-14 18:12:57

my original solution solved that to some extend. See also my comment on your question. More importantly, check the updated answer, section Applied Solution, which brings my generalized solution into your specific problem domain.

Abel 2009-10-14 18:51:19

fixed some errors. The examples are now copy/paste correct with your use-case.

Abel 2009-10-14 19:18:13

Answer 2

A:

In order to do such a thing, you could use negative look ahead before matching a new group.

I will use a more general PCRE notation:

(.)((?!\1).)((?!\1|\2).)\3\2((?!\1\2\3).)\2\1

The regex above will match the string 12332421 but will not match 12112421 or 11111111.

A short explanation:

(.)           // match any character (except line breaks) and store it in group 1
(             // open group 2
  (?!\1)      //   if looking ahead group 1 cannot be seen,
  .           //   match any character (except line breaks)
)             // close group 2
(             // open group 3
  (?!\1|\2)   //   if looking ahead group 1 or 2 cannot be seen,
  .           //   match any character (except line breaks)
)             // close group 3
\3            // back referencing group 3
\2            // back referencing group 2
(             // open group 4
  (?!\1\2\3)  //   if looking ahead group 1, 2 or 3 cannot be seen,
  .           //   match any character (except line breaks)
)             // close group 4
\2            // back referencing group 2
\1            // back referencing group 1

Of course, you don't really need to group #4 since you're not back referencing it.

You might agree with me that regex is not be the best tool to do this kind of matching...

Edit:

Well, I don't see how you're going to build these regexes, but can't imagine it being easier than this little method which simply accepts a pattern and a target string and tests if they match:

public class Test {

    public static boolean matchesPattern(String text, String pattern) {
        if(text.length() != pattern.length()) return false;
        Map<Character, Character> mappings = new HashMap<Character, Character>();
        for(int index = 0; index < pattern.length(); index++) {
            Character patternChar = pattern.charAt(index);
            Character textChar = text.charAt(index);
            if(mappings.containsKey(patternChar)) {
                if(mappings.get(patternChar) != textChar) return false;
            } 
            else {
                if(mappings.values().contains(textChar)) return false;
                mappings.put(patternChar, textChar);
            }
        }
        return true;
    }

    public static void main(String[] args) {
        String pattern = "abccbdba";
        String[] tests = {"12332421", "12112421", "11111111"};
        for(String t : tests) {
            System.out.println(t+" -> "+matchesPattern(t, pattern));
        }
    }
}

which produces the following output:

12332421 -> true
12112421 -> false
11111111 -> false

Bart Kiers 2009-10-14 18:11:23

You mean unrolling? Which is the same problem as with matching parentheses? .NET supports this explicitly with balancing groups. But also consider the look forward / look backward solution I provided, it may work if I got the problem domain correct.

Abel 2009-10-14 18:16:17

I figured regex would be best because in addition to matching a whole word, it could find the pattern as a substring which would require a very convoluted amount of logic to do by matching charAt idx 0 with charAt idx 3, but not equal to charAt idx 1 and 2, etc. Nice solution, btw.

Roly 2009-10-14 18:29:22

Thanks Roly. Also see my edit.

Bart Kiers 2009-10-14 19:09:17

@Abel, not sure what you mean by "unrolling", sorry. I know very little C#, so I'm not familiar with with this "balanced groupings" you mentioned.

Bart Kiers 2009-10-14 19:11:48

no problem. Balanced groups are used to be sure that "[[helo] world]" matches but "[[helo] world" not (i.e., balanced parentheses: balanced groups, hence the name). The traditional approach is "unrolling" (J. Friedl's term) which is cumbersome when any depth of parentheses is involved. I updated my solution to fix it "regex-only" like. I didn't use anything .NET specific in the end.

Abel 2009-10-14 19:21:00

Abel, thanks for the clarification. I seems I haven't read my copy of Friedl's book often enough: the term didn't ring a bell.

Bart Kiers 2009-10-14 19:36:40

By the way, of course "balancing" did ring a bell, but I thought "balancing groups" might have meant something else.

Bart Kiers 2009-10-14 19:37:37

ansaurus

tags:

views:

answers:

Can one use .NET RegEx backreference in a character class to exclude previously matched characters?

Solution

How it works

Applied solution

related questions