views:

381

answers:

2

I've implemented several different "scanners" in java, from the Scanner class to simply using

String.split("\ss+")

but when there are several whitespaces in a row like "the_quick____brown___fox" they all tokenize certain white spaces (Imagine the underscores are whitespaces). Any suggestions?

A: 

Use java.util.Scanner.

EJP
I'm getting the same amount of whitespace tokens - roughly 14k for my test input - with Scanner as with String.split.
You shouldn't be getting any 'whitespace tokens'. Whitespace isn't a token, it is the stuff in between tokens. java.util.Scanner gives you the opportunity to define what your tokens are and what your delimiters are i.e. what your whitespace is. Don't waste its time and yours by making it return whitespace to you.
EJP
+1  A: 

I'm not sure what you are talking about. For example,

String[] parts = "the quick    brown   fox".split("\\s+");

correctly tokenizes the string with no leading or trailing whitespaces on any token, and no empty tokens. If the input string may have leading or trailing whitespaces, then calling String.trim() will remove the possibility of empty tokens.

EDIT I surmise from your other comment that you are reading the input a line at a time and then tokenizing the lines. You probably need to trim each line before tokenizing.

Stephen C