tags:

views:

108

answers:

2

The task I was trying to accomplish was that given an input pattern, e.g. 1 2 3 3 2 4 2 1, to go through a dictionary and find words that fit the given pattern. In my code, I tried taking the given string and converting it to a regular expression like so:

(?<1>.)(?<2>.)(?<3>.)(\k<3>)(\k<2>)(?<4>.)(\k<2>)(\k<1>)

(Before anyone starts bashing use of the dot here, since my input is a dictionary file with only real words, I left the dots to have a cleaner looking expression rather than specifying ranges of characters.)

This expression manages to find the word correctly, but there's a flaw in it. The problem becomes very apparent with a pattern such as 1 2 3 4 5 6. My algorithm generates the following regular expression:

(?<1>.)(?<2>.)(?<3>.)(?<4>.)(?<5>.)(?<6>.)

This is wrong because it will match any 6 character string without taking into account that each group should NOT match any characters that have already been matched by previous groups. In other words, it doesn't take into account that each letter is distinct; no repeats.

So I tried looking all over the internet for syntax to exclude a named group within a character class, i.e.

[^\1] (doesn't work), [^(\k<1>)] (doesn't work), [^${1}] (doesn't work)...etc.

In the .NET documentation, it shows that \p{name} is valid syntax in a character class, but I tried [^\p{1}] and didn't work either.

So, question remains...is it possible to exclude a named group from further matching? Or, how else would I solve this?

UPDATE

Posting my final solution based on the responses I got here. This method takes a string specifying the pattern one is looking for and converts it into a regular expression that I then apply to a dictionary and find all words that fit the pattern.

    string pattern = "12332421";

    private void CreateRegEx()
    {
        string regex = "^";

        for( int i = 0; i < pattern.Length; i++ )
        {
            char c = pattern[i];
            if (char.IsDigit(c))
            {
                if (isUnique(c))
                {
                    regex += "(.)(?!.*\\" + c + ")(?<!\\" + c + ".+)";
                }
                else
                {
                    if (isFirstOccurrence(c, i))
                        regex += "(.)";                        
                    else
                        regex += "\\" + c;
                }
            }
            else if (char.IsLetter(c))
                regex += c + "";
            else if (c == '?')
                regex += ".";
        }

        regex += "$";

        reg = new Regex(regex, RegexOptions.IgnoreCase);
    }

    private bool isUnique(char c)
    {
        return pattern.IndexOf(c) == pattern.LastIndexOf(c);
    }

    private bool isFirstOccurrence(char c, int i)
    {
        return pattern.IndexOf(c) == i;
    }

    public List<string> GetMatches()
    {
        return dictionary.FindAll(x => reg.IsMatch(x));
    }

Thanks again for the awesome responses.

+1  A: 

The answer is: no. You cannot use backreferences in character classes of .NET regular expressions. Sorry. See below for a workaround for your situation.

"it shows that \p{name} is valid syntax in a character class"

yes, it is. But the .NET documentation does not say that name will be interpreted from a backreference. It must be a unicode literal class string.

"In other words, it doesn't take into account that each letter is distinct; no repeats."

I understand that this means to match all of e f a x and only f and x in e f e x. In other words: match unique characters, do not match characters that are repeated.

Solution

I understand your question as follows: match all unique words (subexpressions, characters) in a string that have no repetition before or after itself. The basic regular expression you should use is this:

(subexpr)(?!.*\1)(?<!\1.+)

which will find the word subexpr only when it appears once in the matching string. for instance, if we change it to match the e in e f a x and not in e f e x, it will look like this:

(e)(?!.*\1)(?<!\1.+)

You can generalize this to match every unique letter in the string:

(.)(?!.*\1)(?<!\1.+)

if will match e, f, a and x in e f a x and only f and x in e f e x. This could be the generalized replacement for your expression above and you don't need to repeat the 1,2,3 etc captures anymore.

How it works

(update) Perhaps nice to know how the above regex works:

(subexpr)   # grab subexpression (can be any valid grouped regex)
(?!.*\1)    # negative look forward with a backrefence: if followed somewhere by itself, fail
(?<!\1.+)   # negative look backward with backref: if preceded somewhere by itself, fail

Applied solution

A word has a pattern. SUCCUBUS is 1 2 3 3 2 4 2 1. PAST is 1 2 3 4. Based on that pattern, a regex should match words with the same pattern: same length of the word, repetition of the same letters in the same place: PAST and RANT have the same pattern. LOOK and HEEL have the same pattern, but not HERE.

Taking the previous solution, we adjust that generally to your problem domain by sticking to the following rules:

  1. A unique letter is represented by (.)(?!.*\X)(?<!\X.+)
  2. A repeated letter is represented by (.)
  3. The location where repetition occurs is represented by \X (no parentheses!)
  4. \X means a backreference with the number of your pattern

Examples:

# SUCCUBUS is 1 2 3 3 2 4 2 1 (only 4 is unique)
(.)                      # nr 1 in pattern
(.)                      # nr 2 in pattern
(.)                      # nr 3 in pattern
\3                       # repeat 3
\2                       # repeat 2
(.)(?!.*\4)(?<!\4.+)     # nr 4 UNIQUE!
\2                       # repeat 2
\1                       # repeat 1

# PAST (all unique: 1 2 3 4)
(.)(?!.*\1)(?<!\1.+)    # nr 1 in pattern
(.)(?!.*\2)(?<!\2.+)    # nr 2 in pattern
(.)(?!.*\3)(?<!\3.+)    # nr 3 in pattern
(.)(?!.*\4)(?<!\4.+)    # nr 4 in pattern

This pattern should be easily automated into your current system.

An excellent way to test this and other regexes (just copy and paste mine) is at Regex Hero, free online SilverLight .NET regex tester. For other online testers, see my overview chart of them.

update: removed earlier irrelevant update note

Update 1: In a comment in the other solution, you say that you want to be able to match a substring that fits the pattern. Naturally, this poses a challenge with the negative look-forward/behind: as they are now, they look through the whole string. Replace the .* and .+ with the relative length the expression would be at the place, pos 3 for PAST becomes then (.)(?!.{1}\3)(?<!\3.{2}) and pos 4 would become (.)(?!.{2}\3)(?<!\3.{3})

Update 2: in the same way, it is possible to optimize slightly by removing the look-back in the first expression and removing the look-forward in the last, if they need to be unique: pos 1 becomes (.)(?!.{3}\3) and pos 4 becomes (.)(?<!\3.{3})

Abel
if the input is of the form "1 2 3 4", that means we're looking for a 4 letter word where each character is distinct. So yes, in that case, efax would be a match, but efex wouldn't. How do you state that logic in a regex (if possible)?
Roly
my original solution solved that to some extend. See also my comment on your question. More importantly, check the updated answer, section Applied Solution, which brings my generalized solution into your specific problem domain.
Abel
fixed some errors. The examples are now copy/paste correct with your use-case.
Abel
A: 

In order to do such a thing, you could use negative look ahead before matching a new group.

I will use a more general PCRE notation:

(.)((?!\1).)((?!\1|\2).)\3\2((?!\1\2\3).)\2\1

The regex above will match the string 12332421 but will not match 12112421 or 11111111.

A short explanation:

(.)           // match any character (except line breaks) and store it in group 1
(             // open group 2
  (?!\1)      //   if looking ahead group 1 cannot be seen,
  .           //   match any character (except line breaks)
)             // close group 2
(             // open group 3
  (?!\1|\2)   //   if looking ahead group 1 or 2 cannot be seen,
  .           //   match any character (except line breaks)
)             // close group 3
\3            // back referencing group 3
\2            // back referencing group 2
(             // open group 4
  (?!\1\2\3)  //   if looking ahead group 1, 2 or 3 cannot be seen,
  .           //   match any character (except line breaks)
)             // close group 4
\2            // back referencing group 2
\1            // back referencing group 1

Of course, you don't really need to group #4 since you're not back referencing it.

You might agree with me that regex is not be the best tool to do this kind of matching...

Edit:

Well, I don't see how you're going to build these regexes, but can't imagine it being easier than this little method which simply accepts a pattern and a target string and tests if they match:

public class Test {

    public static boolean matchesPattern(String text, String pattern) {
        if(text.length() != pattern.length()) return false;
        Map<Character, Character> mappings = new HashMap<Character, Character>();
        for(int index = 0; index < pattern.length(); index++) {
            Character patternChar = pattern.charAt(index);
            Character textChar = text.charAt(index);
            if(mappings.containsKey(patternChar)) {
                if(mappings.get(patternChar) != textChar) return false;
            } 
            else {
                if(mappings.values().contains(textChar)) return false;
                mappings.put(patternChar, textChar);
            }
        }
        return true;
    }

    public static void main(String[] args) {
        String pattern = "abccbdba";
        String[] tests = {"12332421", "12112421", "11111111"};
        for(String t : tests) {
            System.out.println(t+" -> "+matchesPattern(t, pattern));
        }
    }
}

which produces the following output:

12332421 -> true
12112421 -> false
11111111 -> false
Bart Kiers
You mean unrolling? Which is the same problem as with matching parentheses? .NET supports this explicitly with balancing groups. But also consider the look forward / look backward solution I provided, it may work if I got the problem domain correct.
Abel
I figured regex would be best because in addition to matching a whole word, it could find the pattern as a substring which would require a very convoluted amount of logic to do by matching charAt idx 0 with charAt idx 3, but not equal to charAt idx 1 and 2, etc. Nice solution, btw.
Roly
Thanks Roly. Also see my edit.
Bart Kiers
@Abel, not sure what you mean by "unrolling", sorry. I know very little C#, so I'm not familiar with with this "balanced groupings" you mentioned.
Bart Kiers
no problem. Balanced groups are used to be sure that "[[helo] world]" matches but "[[helo] world" not (i.e., balanced parentheses: balanced groups, hence the name). The traditional approach is "unrolling" (J. Friedl's term) which is cumbersome when any depth of parentheses is involved. I updated my solution to fix it "regex-only" like. I didn't use anything .NET specific in the end.
Abel
Abel, thanks for the clarification. I seems I haven't read my copy of Friedl's book often enough: the term didn't ring a bell.
Bart Kiers
By the way, of course "balancing" did ring a bell, but I thought "balancing groups" might have meant something else.
Bart Kiers