ansaurus

Question

Answer 1

+1 A:

Split chops a string into multiple pieces, but that doesn't allow for overlap. You'd need to use a loop to get overlapping pieces.

Ry4an 2010-03-13 03:26:55

Answer 2

+1 A:

I don't think you can do this with split() because it throws away the part that matches the regular expression.

In Perl this works:

my $string = '12345';
my @array = ();
while ( $string =~ s/(\d(\d))/$2/ ) {
    push(@array, $1);
}
print join(" ", @array);
# prints: 12 23 34 45

The find-and-replace expression says: match the first two adjacent digits and replace them in the string with just the second of the two digits.

Ian C. 2010-03-13 03:29:42

Answer 3

+4 A:

This somewhat heavy implementation using Matcher.find instead of split will also work, although by the time you have to code a for loop for such a trivial task you might as well drop the regular expressions altogether and use substrings (for similar coding complexity minus the CPU cycles):

import java.util.*;
import java.util.regex.*;

public class StringSplit { 
    public static void main(String args[]) { 
        ArrayList<String> result = new ArrayList<String>();
        for (Matcher m = Pattern.compile("..").matcher("12345"); m.find(result.isEmpty() ? 0 : m.start() + 1); result.add(m.group()));
        System.out.println( result.toString() ); // prints "[12, 23, 34, 45]" 
    } 
}

EDIT1

match(): the reason why nobody so far has been able to concoct a regular expression like your BONUS_REGEX lies within Matcher, which will resume looking for the next group where the previous group ended (i.e. no overlap), as oposed to after where the previous group started -- that is, short of explicitly respecifying the start search position (above). A good candidate for BONUS_REGEX would have been "(.\\G.|^..)" but, unfortunately, the \G-anchor-in-the-middle trick doesn't work with Java's Match (but works just fine in Perl):

 perl -e 'while ("12345"=~/(^..|.\G.)/g) { print "$1\n" }'
 12
 23
 34
 45

split(): as for INSERT_REGEX_HERE a good candidate would have been (?<=..)(?=..) (split point is the zero-width position where I have two characters to my right and two to my left), but again, because split concieves naught of overlap you end up with [12, 3, 45] (which is close, but no cigar.)

EDIT2

For fun, you can trick split() into doing what you want by first doubling non-boundary characters (here you need a reserved character value to split around):

Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1#$1").split("#")

We can be smart and eliminate the need for a reserved character by taking advantage of the fact that zero-width look-ahead assertions (unlike look-behind) can have an unbounded length; we can therefore split around all points which are an even number of characters away from the end of the doubled string (and at least two characters away from its beginning), producing the same result as above:

Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1").split("(?<=..)(?=(..)*$)")

Alternatively tricking match() in a similar way (but without the need for a reserved character value):

Matcher m = Pattern.compile("..").matcher(
  Pattern.compile("((?<=.).(?=.))").matcher("12345").replaceAll("$1$1")
);
while (m.find()) { 
    System.out.println(m.group()); 
} // prints "12", "23", "34", "45"

Cheers, V.

vladr 2010-03-13 03:47:43

+1. This kind of exploration is the spirit of my question.

polygenelubricants 2010-03-13 04:52:53

+1 for alerting me to the \G-in-the-middle trick, even if it *does* work only in Perl.

Alan Moore 2010-03-13 11:43:48

+1 for the \G-trick, does that work in PCRE also?

Qtax 2010-03-16 03:12:41

Answer 4

+4 A:

I don't think this is possible with split(), but with find() it's pretty simple. Just use a lookahead with a capturing group inside:

Matcher m = Pattern.compile("(?=(\\d\\d)).").matcher("12345");
while (m.find())
{
  System.out.println(m.group(1));
}

Many people don't realize that text captured inside a lookahead or lookbehind can be referenced after the match just like any other capture. It's especially counter-intuitive in this case, where the capture is a superset of the "whole" match.

As a matter of fact, it works even if the regex as a whole matches nothing. Remove the dot from the regex above ("(?=(\\d\\d))") and you'll get the same result. This is because, whenever a successful match consumes no characters, the regex engine automatically bumps ahead one position before trying to match again, to prevent infinite loops.

There's no split() equivalent for this technique, though, at least not in Java. Although you can split on lookarounds and other zero-width assertions, there's no way to get the same character to appear in more than one of the resulting substrings.

Alan Moore 2010-03-13 09:10:55

+1! BRAVO!! "Many people don't realize..." <- I certainly didn't! Great job so far, guys! Keep it going!

polygenelubricants 2010-03-13 09:53:42

My eyes are certainly opened. I've always thought that group(0) is necessarily a superstring of all other groups.

polygenelubricants 2010-03-13 10:00:58

+1 for pointing out that captures *inside* zero-width assertions are not zero-width (despite captures outside being zero-width.) :)

vladr 2010-03-14 22:23:31

Answer 5

A:

Alternative, using plain matching with Perl. Should work anywhere where lookaheads do. And no need for loops here.

 $_ = '12345';
 @list = /(?=(..))./g;
 print "@list";

 # Output:
 # 12 23 34 45

But this one, as posted earlier, is nicer if the \G trick works:

 $_ = '12345';
 @list = /^..|.\G./g;
 print "@list";

 # Output:
 # 12 23 34 45

Edit: Sorry, didn't see that all of this was posted already.

Qtax 2010-03-15 18:05:45

ansaurus

tags:

views:

answers:

Regex split into overlapping strings

EDIT1

EDIT2

related questions