views:

330

answers:

1

Hi, I'm trying to find a good way to get a Scanner to use a given delimiter as a token. For example, I'd like to split up a piece of text into digit and non-digit chunks, so ideally I'd just set the delimiter to \D and set some flag like useDelimiterAsToken, but after briefly looking through the API I'm not coming up with anything. Right now I've had to resort to using combined lookaheads/lookbehinds for the delimiter, which is somewhat painful:

scanner.useDelimiter("((?<=\\d)(?=\\D)|(?<=\\D)(?=\\d))");

This looks for any transition from a digit to a non-digit or vice-versa. Is there a more sane way to do this?

A: 

EDIT: The edited question is so different, my original answer doesn't apply at all. For the record, what you're doing is the ideal way to solve your problem, in my opinion. Your delimiter is the zero-width boundary between a digit and a non-digit, and there's no more succinct way to express that than what you posted.

EDIT2: (In response to the question asked in the comment.) You originally asked for an alternative to this regex:

"((?<=\\w)(?=[^\\w])|(?<=[^\\w])(?=\\w))"

That's almost exactly how \b, the word-boundary construct, works:

"(?<=\\w)(?!\\w)|(?<!\\w)(?=\\w)"

That is, a position that's either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. The difference is that \b can match at the beginning and end of the input. You obviously didn't want that, so I added lookarounds to exclude those conditions:

"(?!^)\\b(?!$)"

It's just a more concise way to do what your regex did. But then you changed the requirement to matching digit/non-digit boundaries, and there's no shorthand for that like \b for word/non-word boundaries.

Alan Moore
They are already prevented in the regex I'm using.
I know, I was just suggesting a shorter regex to accomplish the same thing. But you changed the requirements, so that's irrelevant now.
Alan Moore
I'm still a little confused as to how your solution would have helped in the first case. It's the same problem now except with digits instead of words.. I just didn't want word boundaries to be an option, since I'm actually doing something a bit more complicated.
See my second edit for my response to that comment.
Alan Moore
Ah makes sense, thanks.