ansaurus

Question

Split string of varying length using regex

Answer 1

+5 A:

Regex purpose is to recognize patterns. Here you doesn't search for patterns but for a length split. So regex are not appropriate.

It is propably possible, but not with a single regex : to find the first n characters using a regex, you use: "^(.{n}).*"

So, you can search with that regex for the 1st character. Then, you make a substring, and you search for the 2 next characters. Etc.

Like @splash said, it will make the code more complicated, and unefficient, since you use regex for something outside of their purpose.

Benoit Courtine 2010-09-10 06:40:07

@Benoit: the word "patterns" can be interpreted very broadly. My solution is a single regex that does the job, so by that definition, there is a well-defined splitting delimiter "pattern". That said, yes, it's definitely an overkill to use regex to solve this task, but boy, can you learn so much from it.

polygenelubricants 2010-09-11 14:11:40

Answer 2

A:

String a = "hellohowareyou??";
int i = 1;

    while(true) {

        if(i >= a.length()) {
            System.out.println(a);
            break;
        }

        else {
            String b = a.substring(i++);
            String[] out = a.split(Pattern.quote(b) + "$");
            System.out.println(out[0]);
            a = b;
            if(b.isEmpty())
                break;
        }

    }

xnikolai 2010-09-10 08:47:02

Answer 3

+7 A:

Solution

The following snippet generates the pattern that does the job (see it run on ideone.com):

// splits at indices that are triangular numbers
class TriangularSplitter {

  // asserts that the prefix of the string matches pattern
  static String assertPrefix(String pattern) {
    return "(?<=(?=^pattern).*)".replace("pattern", pattern);
  }
  // asserts that the entirety of the string matches pattern
  static String assertEntirety(String pattern) {
    return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
  }
  // repeats an assertion as many times as there are dots behind current position
  static String forEachDotBehind(String assertion) {
    return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
  }

  public static void main(String[] args) {
    final String TRIANGULAR_SPLITTER =
      "(?x) (?<=^.) | measure (?=(.*)) check"
        .replace("measure", assertPrefix("(?: notGyet . +NBefore +1After)*"))
        .replace("notGyet", assertPrefix("(?! \\1 \\G)"))
        .replace("+NBefore", forEachDotBehind(assertPrefix("(\\1? .)")))
        .replace("+1After", assertPrefix(".* \\G (\\2?+ .)"))
        .replace("check", assertEntirety("\\1 \\G \\2 . \\3"))
        ;
    String text = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
    System.out.println(
        java.util.Arrays.toString(text.split(TRIANGULAR_SPLITTER))
    );
    // [a, bc, def, ghij, klmno, pqrstu, vwxyzAB, CDEFGHIJ, KLMNOPQRS, TUVWXYZ]
  }
}

Note that this solution uses techniques already covered in my regex article series. The only new thing here is \G and forward references.

References

This is a brief description of the basic regex constructs used:

(?x) is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
^ and $ are the beginning and end-of-the-line anchors. \G is the end-of-previous match anchor.
| denotes alternation (i.e. "or").
? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
(…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g. \1).
(?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern.(?<=…) is a positive lookbehind; it looks to the left.
(?!…) is a negative lookahead; it looks to the right to assert that there isn't a match of a pattern.

Explanation

The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:

(?<=^.), i.e. we can lookbehind and see the beginning of the string one dot away
- This matches at index 1, and is a crucial starting point to the rest of the process
Otherwise, we measure to reconstruct how the last match was made (using \G as reference point), storing the result of the measurement in "before" \G and "after" \G capturing groups. We then check if the current position is the one prescribed by the measurement to find where the next match should be made.

Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn't have custom-named groups, but here are the semantics for the 3 capturing groups:

\1 captures the string "before" \G
\2 captures some string "after" \G
If the length of \1 is e.g. 1+2+3+...+k, then the length of \2 needs to be k.
- Hence \2 . has length k+1 and should be the next part in our split!
\3 captures the string to the right of our current position
- Hence when we can assertEntirety on \1 \G \2 . \3, we match and set the new \G

You can use mathematical induction to rigorously prove the correctness of this algorithm.

To help illustrate how this works, let's work through an example. Let's take abcdefghijklm as input, and say that we've already partially splitted off [a, bc, def].

          \G     we now need to match here!
           ↓       ↓
a b c d e f g h i j k l m n
\____1____/ \_2_/ . \__3__/   <--- \1 G \2 . \3
  L=1+2+3    L=3

Remember that \G marks the end of the last match, and it occurs at triangular number indices. If \G occured at 1+2+3+...+k, then the next match needs to be k+1 positions after \G to be a triangular number index.

Thus in our example, given where \G is where we just splitted off def, we measured that k=3, and the next match will split off ghij as expected.

To have \1 and \2 be built according to the above specification, we basically do a while "loop": for as long as it's notGyet, we count up to k as follows:

+NBefore, i.e. we extend \1 by one forEachDotBehind
+1After, i.e. we extend \2 by just one

Note that notGyet contains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until \1 "hits" \G.

Conclusion

Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with \G), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren't so "forgetful"), by painstakingly building strings by appending one character at a time (which is O(N^2)). Each simple measurement is linear instead of constant time (since it's done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).

There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.

That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.

polygenelubricants 2010-09-10 14:04:53

ansaurus

tags:

views:

answers:

Split string of varying length using regex

Solution

References

Related questions

Explanation

Conclusion

related questions