tags:

views:

132

answers:

5

Hi,

Maybe it's because it's end of day on a Friday, and I have already found a work-around, but this is killing me.

I am using Java but am .NET developer.

I have a string and I need to split it on semicolon comma. Let's say its a row in a CSV file who has 200 210 columns. line.split(',').length will be sometimes, 199, where count of ',' will be 208 OR 209. I find count in 2 different ways even to be sure (using a regex, then manually looping through and checking the character after losing my sanity).

What's the super-obvious-hit-face-on-desk thing I'm missing here? Why isn't foo.split(delim).length == CountOfOccurences(foo,delim) all the time, only sometimes?

thanks much

+6  A: 

First, there's an obvious difference of one. If there are 200 columns, all with text, there are 199 commas. Second, Java drops trailing empty strings by default. You can change this by passing a negative number as the second argument.

"foo,,bar,baz,,".split(",")

is:

{foo,,bar,baz}

an array of 4 elements. But

"foo,,bar,baz,,".split(",", -1)

is::

{foo,,bar,baz,,}

with all 6.

Note that only trailing empty strings are dropped by default.

Finally, don't forget that the String is compiled into a regex. This is not be applicable here, since , is not a special character, but you should keep it in mind.

Matthew Flaschen
wow. I spent half my day because for some reason Sun decided to make .split() completely unintuitive. Or maybe I'm just a bitter .NET developer. Holy shit, thanks, I can leave in just 4 more hours, now that I have to make up for lost time. (I'm writing a parser for csv and of course there's a bunch of records which have empty columns at the END, only sometimes. This shit drove me fucking crazy!)
dferraro
at least SUN documented it: "Trailing empty strings are therefore not included in the resulting array."
Carlos Heuberger
+8  A: 

There are a couple things happening. First, if you have three items like a,b,c and split on comma, you'll have three entries, one more than the number of commas.

But what you're dealing with probably comes from consecutive delimiters. : a,,,,b,c,,,,,

The ones at the end get dropped. Check the java documentation for the split function. http://download.java.net/jdk7/docs/api/java/lang/String.html

JoshD
And to shortcut the docs, `split(str, -1)` will preserve trailing empty Strings.
Michael Brewer-Davis
A: 

Short example: foo = "1,2" and

foo.split(",").length = 2
 count(foo, ",") = 1

Probably you have a mistake in your code. Here is an example in Java code:

       String row = "1,2,3,4,,5"; //  second example: 1,2,3,5,,    
         System.out.println(row.split(",").length); // print 6 in both cases


       // code to count how many , you have in your row
       Pattern patter = Pattern.compile(",");
       Matcher m = patter.matcher(row);


       int nr = 0;
       while(m.find())
       {
                  nr++;

       }
       System.out.println(nr); // print 5 for the first example and 6 for second
Skarab
A: 

Is it omitting blanks?

Do you have something like "a,b,c,,d,e" or trailing delimiters like "a,b,c,,,,"?

Are there extra delimiters in the cell data?

Matthew Cole
+1  A: 

As others have pointed out, String.split has some very non-intuitive behaviour.

If you're using Google's Guava open-source Java library, there's a Splitter class which gives a much nicer (in my opinion) API for this, with more flexibility:

String input = "foo, bar,";

Splitter.on(',').split(input);
// returns "foo", " bar", ""

Splitter.on(',').omitEmptyStrings().split(input);
// returns "foo", " bar"

Splitter.on(',').omitEmptyStrings().trimResults().split(input);
// returns "foo", "bar"
Cowan