views:

5562

answers:

10

I'm trying to split a string with all non-alphanumeric characters as delimiters yet Java's String.split() method discards the delimiter characters from the resulting array. Is there a way to split a string like the "\W" regex pattern does, yet keep the delimiters?

A: 

I don't know Java too well, but if you can't find a Split method that does that, I suggest you just make your own.

string[] mySplit(string s,string delimiter)
{
    string[] result = s.Split(delimiter);
    for(int i=0;i<result.Length-1;i++)
    {
     result[i] += delimiter; //this one would add the delimiter to each items end except the last item, 
        //you can modify it however you want
    }
}
string[] res = mySplit(myString,myDelimiter);

Its not too elegant, but it'll do.

but what if you have multiple delimiters in a row?
Kip
A: 

Fast answer: use non physical bounds like \b to split. I will try and experiment to see if it works (used that in PHP and JS).

It is possible, and kind of work, but might split too much. Actually, it depends on the string you want to split and the result you need. Give more details, we will help you better.

Another way is to do your own split, capturing the delimiter (supposing it is variable) and adding it afterward to the result.

My quick test:

String str = "'ab','cd','eg'";
String[] stra = str.split("\\b");
for (String s : stra) System.out.print(s + "|");
System.out.println();

Result:

'|ab|','|cd|','|eg|'|

A bit too much... :-)

PhiLho
A: 

I don't know of an existing function in the Java API that does this (which is not to say it doesn't exist), but here's my own implementation (one or more delimiters will be returned as a single token; if you want each delimiter to be returned as a separate token, it will need a bit of adaptation):

static String[] splitWithDelimiters(String s) {
    if (s == null || s.length() == 0) {
        return new String[0];
    }
    LinkedList<String> result = new LinkedList<String>();
    StringBuilder sb = null;
    boolean wasLetterOrDigit = !Character.isLetterOrDigit(s.charAt(0));
    for (char c : s.toCharArray()) {
        if (Character.isLetterOrDigit(c) ^ wasLetterOrDigit) {
            if (sb != null) {
                result.add(sb.toString());
            }
            sb = new StringBuilder();
            wasLetterOrDigit = !wasLetterOrDigit;
        }
        sb.append(c);
    }
    result.add(sb.toString());
    return result.toArray(new String[0]);
}
bdumitriu
A: 

Code Challenge! (Community wiki post)

Implement a class which is able to split a String, while getting the delimiter before the next split elements. Uses a real regexp (and not a character sequence).

The end result must look like:

StringTokenizerEx usages:

  • limit cases:
    'null' to be splitted with regexp 'null' gives []
    '' to be splitted with regexp 'null' gives []
    'null' to be splitted with regexp '' gives []
    '' to be splitted with regexp '' gives []
  • border cases:
    'abcd' to be splitted with regexp 'ab' gives [ab], 'cd', []
    'abcd' to be splitted with regexp 'cd' gives [], 'ab', [cd]
    'abcd' to be splitted with regexp 'abcd' gives [abcd]
    'abcd' to be splitted with regexp 'bc' gives [], 'a', [bc], 'd', []
  • real cases:
'abcd    efg  hi   j' to be splitted with regexp '[ \t\n\r\f]+' gives [], 'abcd', [   ], 'efg', [  ], 'hi', [   ], 'j', []
''ab','cd','eg'' to be splitted with regexp '\W+' gives ['], 'ab', [','], 'cd', [','], 'eg', [']
  • split-like cases:
'boo:and:foo' to be splitted with regexp ':' gives [], 'boo', [:], 'and', [:], 'foo', []
'boo:and:foo' to be splitted with regexp 'o' gives [], 'b', [o], '', [o], ':and:f', [o], '', [o]
'boo:and:foo' to be splitted with regexp 'o+' gives [], 'b', [oo], ':and:f', [oo]

Delimiters are within square brackets [...]
Spli strings are with simple quotes ''

VonC
+3  A: 

I like the idea of StringTokenizer because it is Enumerable.
But it is also obsolete, and replace by String.split which return a boring String[] (and does not includes the delimiters).

So I implemented a StringTokenizerEx which is an Iterable, and which takes a true regexp to split a string.

A true regexp means it is not a 'Character sequence' repeated to form the delimiter:
'o' will only match 'o', and split 'ooo' into three delimiter, with two empty string inside:

[o], '', [o], '', [o]

But the regexp o+ will return the expected result when splitting "aooob"

[], 'a', [ooo], 'b', []

To use this StringTokenizerEx:

final StringTokenizerEx aStringTokenizerEx = new StringTokenizerEx("boo:and:foo", "o+");
final String firstDelimiter = aStringTokenizerEx.getDelimiter();
for(String aString: aStringTokenizerEx )
{
    // uses the split String detected and memorized in 'aString'
    final nextDelimiter = aStringTokenizerEx.getDelimiter();
}

The code of this class is available at DZone Snippets.

As usual for a code-challenge response (one self-contained class with test cases included), copy-paste it (in a 'src/test' directory) and run it. Its main() method illustrates the different usages.


Note: (late 2009 edit)

The article Final Thoughts: Java Puzzler: Splitting Hairs does a good work explaning the bizarre behavior in String.split().
Josh Bloch even commented in response to that article:

Yes, this is a pain. FWIW, it was done for a very good reason: compatibility with Perl.
The guy who did it is Mike "madbot" McCloskey, who now works with us at Google. Mike made sure that Java's regular expressions passed virtually every one of the 30K Perl regular expression tests (and ran faster).

The Google common-library Guava contains also a Splitter which is:

  • simpler to use
  • maintained by Google (and not by you)

So it may worth being checked out. From their initial rough documentation (pdf):

JDK has this:

String[] pieces = "foo.bar".split("\\.");

It's fine to use this if you want exactly what it does: - regular expression - result as an array - its way of handling empty pieces

Mini-puzzler: ",a,,b,".split(",") returns...

(a) "", "a", "", "b", ""
(b) null, "a", null, "b", null
(c) "a", null, "b"
(d) "a", "b"
(e) None of the above

Answer: (e) None of the above.

",a,,b,".split(",")
returns
"", "a", "", "b"

Only trailing empties are skipped! (Who knows the workaround to prevent the skipping? It's a fun one...)

In any case, our Splitter is simply more flexible: The default behavior is simplistic:

Splitter.on(',').split(" foo, ,bar, quux,")
--> [" foo", " ", "bar", " quux", ""]

If you want extra features, ask for them!

Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(" foo, ,bar, quux,")
--> ["foo", "bar", "quux"]

Order of config methods doesn't matter -- during splitting, trimming happens before checking for empties.

VonC
+5  A: 
import java.util.regex.*;
import java.util.LinkedList;

public class Splitter {
    private static final Pattern DEFAULT_PATTERN = Pattern.compile("\\s+");

    private Pattern pattern;
    private boolean keep_delimiters;

    public Splitter(Pattern pattern, boolean keep_delimiters) {
        this.pattern = pattern;
        this.keep_delimiters = keep_delimiters;
    }
    public Splitter(String pattern, boolean keep_delimiters) {
        this(Pattern.compile(pattern==null?"":pattern), keep_delimiters);
    }
    public Splitter(Pattern pattern) { this(pattern, true); }
    public Splitter(String pattern) { this(pattern, true); }
    public Splitter(boolean keep_delimiters) { this(DEFAULT_PATTERN, keep_delimiters); }
    public Splitter() { this(DEFAULT_PATTERN); }

    public String[] split(String text) {
        if (text == null) {
            text = "";
        }

        int last_match = 0;
        LinkedList<String> splitted = new LinkedList<String>();

        Matcher m = this.pattern.matcher(text);

        while (m.find()) {

            splitted.add(text.substring(last_match,m.start()));

            if (this.keep_delimiters) {
                splitted.add(m.group());
            }

            last_match = m.end();
        }

        splitted.add(text.substring(last_match));

        return splitted.toArray(new String[splitted.size()]);
    }

    public static void main(String[] argv) {
        if (argv.length != 2) {
            System.err.println("Syntax: java Splitter <pattern> <text>");
            return;
        }

        Pattern pattern = null;
        try {
            pattern = Pattern.compile(argv[0]);
        }
        catch (PatternSyntaxException e) {
            System.err.println(e);
            return;
        }

        Splitter splitter = new Splitter(pattern);

        String text = argv[1];
        int counter = 1;
        for (String part : splitter.split(text)) {
            System.out.printf("Part %d: \"%s\"\n", counter++, part);
        }
    }
}

/*
    Example:
    > java Splitter "\W+" "Hello World!"
    Part 1: "Hello"
    Part 2: " "
    Part 3: "World"
    Part 4: "!"
    Part 5: ""
*/

I don't really like the other way, where you get an empty element in front and back. A delimiter is usually not at the beginning or at the end of the string, thus you most often end up wasting two good array slots.

Edit: Fixed limit cases. Commented source with test cases can be found here: http://snippets.dzone.com/posts/show/6453

MizardX
Wahoo... Thank you for participating! Interesting approach. I am not sure it can be help consistently (with that, sometimes there is a delimiter, sometimes there is not), but +1 for the effort. However, you still need to properly address the limit cases (empty or null values)
VonC
I invite you to properly reinforce this class, thoroughly document it, make a pass with findbugs and checkstyle, and then publish it on a snippets website (to avoid cluttering this page with tons of code)
VonC
You won the challenge ! Errr... congratulation! As you know, from the code-challenge thread, there would be no special points or badges for that... (sigh): http://stackoverflow.com/questions/172184. But thank you for this contribution.
VonC
+4  A: 

I got here late, but returning to the original question, why not just use lookarounds?

Pattern p = Pattern.compile("(?<=\\w)(?=\\W)|(?<=\\W)(?=\\w)");
System.out.println(Arrays.toString(p.split("'ab','cd','eg'")));
System.out.println(Arrays.toString(p.split("boo:and:foo")));

output:

[', ab, ',', cd, ',', eg, ']
[boo, :, and, :, foo]

EDIT: What you see above is what appears on the command line when I run that code, but I now see that it's a bit confusing. It's difficult to keep track of which commas are part of the result and which were added by Arrays.toString(). SO's syntax highlighting isn't helping either. In hopes of getting the highlighting to work with me instead of against me, here's how those arrays would look it I were declaring them in source code:

{ "'", "ab", "','", "cd", "','", "eg", "'" }
{ "boo", ":", "and", ":", "foo" }

I hope that's easier to read. Thanks for the heads-up, @finnw.

Alan Moore
-1 First output does not match input
finnw
I know it looks wrong--it looked wrong to me when I came back to it just now, a year after the fact. The sample input was poorly chosen; I'll edit the post and try to clarify things.
Alan Moore
+4  A: 

I had a look at the above answers and honestly none of them I find satisfactory. What you want to do is essentially mimic the Perl split functionality. Why Java doesn't allow this and have a join() method somewhere is beyond me but I digress. You don't even need a class for this really. Its just a function. Run this sample program:

Some of the earlier answers have excessive null-checking, which I recently wrote a response to a question here:

http://stackoverflow.com/users/18393/cletus

Anyway, the code:

public class Split {
    public static List<String> split(String s, String pattern) {
        assert s != null;
        assert pattern != null;
        return split(s, Pattern.compile(pattern));
    }

    public static List<String> split(String s, Pattern pattern) {
        assert s != null;
        assert pattern != null;
        Matcher m = pattern.matcher(s);
        List<String> ret = new ArrayList<String>();
        int start = 0;
        while (m.find()) {
            ret.add(s.substring(start, m.start()));
            ret.add(m.group());
            start = m.end();
        }
        ret.add(start >= s.length() ? "" : s.substring(start));
        return ret;
    }

    private static void testSplit(String s, String pattern) {
        System.out.printf("Splitting '%s' with pattern '%s'%n", s, pattern);
        List<String> tokens = split(s, pattern);
        System.out.printf("Found %d matches%n", tokens.size());
        int i = 0;
        for (String token : tokens) {
            System.out.printf("  %d/%d: '%s'%n", ++i, tokens.size(), token);
        }
        System.out.println();
    }

    public static void main(String args[]) {
        testSplit("abcdefghij", "z"); // "abcdefghij"
        testSplit("abcdefghij", "f"); // "abcde", "f", "ghi"
        testSplit("abcdefghij", "j"); // "abcdefghi", "j", ""
        testSplit("abcdefghij", "a"); // "", "a", "bcdefghij"
        testSplit("abcdefghij", "[bdfh]"); // "a", "b", "c", "d", "e", "f", "g", "h", "ij"
    }
}
cletus
I'm confused: Java does have a split() method, which is modelled on Perl's, but much less powerful. The problem here is that Java's split() provides no way to return the delimiters, which you can achieve in Perl by enclosing the regex in capturing parentheses.
Alan Moore
A: 

If you can afford, use Java's replace(CharSequence target, CharSequence replacement) method and fill in another delimiter to split with. Example: I want to split the string "boo:and:foo" and keep ':' at its righthand String.

String str = "boo:and:foo";
str = str.replace(":","newdelimiter:");
String[] tokens = str.split("newdelimiter");

Important note: This only works if you have no further "newdelimiter" in your String! Thus, it is not a general solution. But if you know a CharSequence of which you can be sure that it will never appear in the String, this is a very simple solution.

Stephan
+4  A: 

You want to use lookarounds, and split on zero-width matches. Here are some examples:

public class SplitNDump {
    static void dump(String[] arr) {
        for (String s : arr) {
            System.out.format("[%s]", s);
        }
        System.out.println();
    }
    public static void main(String[] args) {
        dump("1,234,567,890".split(","));
        // "[1][234][567][890]"
        dump("1,234,567,890".split("(?=,)"));   
        // "[1][,234][,567][,890]"
        dump("1,234,567,890".split("(?<=,)"));  
        // "[1,][234,][567,][890]"
        dump("1,234,567,890".split("(?<=,)|(?=,)"));
        // "[1][,][234][,][567][,][890]"

        dump(":a:bb::c:".split("(?=:)|(?<=:)"));
        // "[][:][a][:][bb][:][:][c][:]"
        dump(":a:bb::c:".split("(?=(?!^):)|(?<=:)"));
        // "[:][a][:][bb][:][:][c][:]"
        dump(":::a::::b  b::c:".split("(?=(?!^):)(?<!:)|(?!:)(?<=:)"));
        // "[:::][a][::::][b  b][::][c][:]"
        dump("a,bb:::c  d..e".split("(?!^)\\b"));
        // "[a][,][bb][:::][c][  ][d][..][e]"

        dump("ArrayIndexOutOfBoundsException".split("(?<=[a-z])(?=[A-Z])"));
        // "[Array][Index][Out][Of][Bounds][Exception]"
        dump("1234567890".split("(?<=\\G.{4})"));   
        // "[1234][5678][90]"

        // Split at the end of each run of letter
        dump("Boooyaaaah! Yippieeee!!".split("(?<=(?=(.)\\1(?!\\1))..)"));
        // "[Booo][yaaaa][h! Yipp][ieeee][!!]"
    }
}

And yes, that is triply-nested assertion there in the last pattern.

Related questions

See also

polygenelubricants
Promise me when die you'll leave your brain to science.
Alan Moore