Solution
The following snippet generates the pattern that does the job (see it run on ideone.com):
// splits at indices that are triangular numbers
class TriangularSplitter {
// asserts that the prefix of the string matches pattern
static String assertPrefix(String pattern) {
return "(?<=(?=^pattern).*)".replace("pattern", pattern);
}
// asserts that the entirety of the string matches pattern
static String assertEntirety(String pattern) {
return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
}
// repeats an assertion as many times as there are dots behind current position
static String forEachDotBehind(String assertion) {
return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
}
public static void main(String[] args) {
final String TRIANGULAR_SPLITTER =
"(?x) (?<=^.) | measure (?=(.*)) check"
.replace("measure", assertPrefix("(?: notGyet . +NBefore +1After)*"))
.replace("notGyet", assertPrefix("(?! \\1 \\G)"))
.replace("+NBefore", forEachDotBehind(assertPrefix("(\\1? .)")))
.replace("+1After", assertPrefix(".* \\G (\\2?+ .)"))
.replace("check", assertEntirety("\\1 \\G \\2 . \\3"))
;
String text = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
System.out.println(
java.util.Arrays.toString(text.split(TRIANGULAR_SPLITTER))
);
// [a, bc, def, ghij, klmno, pqrstu, vwxyzAB, CDEFGHIJ, KLMNOPQRS, TUVWXYZ]
}
}
Note that this solution uses techniques already covered in my regex article series. The only new thing here is \G
and forward references.
References
This is a brief description of the basic regex constructs used:
(?x)
is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and #
can be used for comments).
^
and $
are the beginning and end-of-the-line anchors. \G
is the end-of-previous match anchor.
|
denotes alternation (i.e. "or").
?
as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*?
it denotes that the *
(i.e. zero-or-more of) repetition is reluctant/non-greedy.
(…)
are used for grouping. (?:…)
is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g. \1
).
(?=…)
is a positive lookahead; it looks to the right to assert that there's a match of the given pattern.(?<=…)
is a positive lookbehind; it looks to the left.
(?!…)
is a negative lookahead; it looks to the right to assert that there isn't a match of a pattern.
Related questions
Explanation
The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:
(?<=^.)
, i.e. we can lookbehind and see the beginning of the string one dot away
- This matches at index 1, and is a crucial starting point to the rest of the process
- Otherwise, we
measure
to reconstruct how the last match was made (using \G
as reference point), storing the result of the measurement in "before" \G
and "after" \G
capturing groups. We then check
if the current position is the one prescribed by the measurement to find where the next match should be made.
Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn't have custom-named groups, but here are the semantics for the 3 capturing groups:
\1
captures the string "before" \G
\2
captures some string "after" \G
- If the length of
\1
is e.g. 1+2+3+...+k, then the length of \2
needs to be k.
- Hence
\2 .
has length k+1 and should be the next part in our split
!
\3
captures the string to the right of our current position
- Hence when we can
assertEntirety
on \1 \G \2 . \3
, we match and set the new \G
You can use mathematical induction to rigorously prove the correctness of this algorithm.
To help illustrate how this works, let's work through an example. Let's take abcdefghijklm
as input, and say that we've already partially splitted off [a, bc, def]
.
\G we now need to match here!
↓ ↓
a b c d e f g h i j k l m n
\____1____/ \_2_/ . \__3__/ <--- \1 G \2 . \3
L=1+2+3 L=3
Remember that \G
marks the end of the last match, and it occurs at triangular number indices. If \G
occured at 1+2+3+...+k, then the next match needs to be k+1 positions after \G
to be a triangular number index.
Thus in our example, given where \G
is where we just splitted off def
, we measured that k=3, and the next match will split off ghij
as expected.
To have \1
and \2
be built according to the above specification, we basically do a while
"loop": for as long as it's notGyet
, we count up to k as follows:
+NBefore
, i.e. we extend \1
by one forEachDotBehind
+1After
, i.e. we extend \2
by just one
Note that notGyet
contains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until \1
"hits" \G
.
Conclusion
Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with \G
), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren't so "forgetful"), by painstakingly building strings by appending one character at a time (which is O(N^2)
). Each simple measurement is linear instead of constant time (since it's done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).
There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.
That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.