tags:

views:

1006

answers:

7

Suppose you have the following string:

white sand, tall waves, warm sun

It's easy to write a regular expression that will match the delimiters, which the Java String.split() method can use to give you an array containing the tokens "white sand", "tall waves" and "warm sun":

\s*,\s*

Now say you have this string:

white sand and tall waves and warm sun

Again, the regex to split the tokens is easy (ensuring you don't get the "and" inside the word "sand"):

\s+and\s+

Now, consider this string:

white sand, tall waves and warm sun

Can a regex be written that will match the delimiters correctly, allowing you to split the string into the same tokens as in the previous two cases? Alternatively, can a regex be written that will match the tokens themselves and omit the delimiters? (Any amount of white space on either side of a comma or the word "and" should be considered part of the delimiter.)

Edit: As has been pointed out in the comments, the correct answer should robustly handle delimiters at the beginning or end of the input string. The ideal answer should be able to take a string like ",white sand, tall waves and warm sun and " and provide these exact three tokens:

[ "white sand", "tall waves", "warm sun" ]

...without extra empty tokens or extra white space at the start or end of any token.

Edit: It's been pointed out that extra empty tokens are unavoidable with String.split(), so that's been removed as a criterion for the "perfect" regex.


Thanks everyone for your responses! I've tried to make sure I upvoted everyone who contributed a workable regex that wasn't essentially a duplicate. Dan's answer was the most robust (it even handles ",white sand, tall waves,and warm sun and " reasonably, with that odd comma placement after the word "waves"), so I've marked his as the accepted answer. The regex provided by nsayer was a close second.

+2  A: 

This should catch both 'and' or ','

(?:\sand|,)\s
Unkwntech
It also matches the and inside "sand", which you don't want.
nsayer
Using the regex tester <a href="http://regexpal.com/">here</a>, this appears to work, except that it doesn't grab enough white space. I'll upvote, though, as it is workable.
Robert J. Walker
@nsayer it will not catch the and in sand, that is why I have the \s
Unkwntech
+1  A: 

Yes, that's what regexp are for :

\s*(?:and|,)\s*

The | defines alternatives, the () groups the selectors and the :? ensure the regexp engine won't try to retain the value between the ().

EDIT : to avoid the sand pitfall (thanks for notifying) :

\s*(?:[^s]and|,)\s*
e-satis
Nope. Matches the and inside "sand".
nsayer
@nsayer no it does not.
Unkwntech
+2  A: 

The problem with

\s*(,|(and))\s*

is that it would split up "sand" inappropriately.

The problem with

\s+(,|(and))\s+

is that it requires spaces around commas.

The right answer probably has to be

(\s*,\s*)|(\s+and\s+)

I'll cheat a little on the concept of returning the strings surrounded by delimiters by suggesting that lots of languages have a "split" operator that does exactly what you want when the regex specifies the form of the delimiter itself. See the Java String.split() function.

nsayer
This works, but Dan brings up a good point regarding delimiters at the start and end. See the comment on his answer, since your regexes produce equivalent results, as far as I can tell. I've upvoted yours because its results are workable, although not perfect.
Robert J. Walker
This is the right answer
Marcio Aguiar
+2  A: 

Would this work?

\s*(,|\s+and)\s+
Shinhan
Dan makes a good point: it should handle delimiters at the front and back appropriately. I've upvoted your answer, since it does work, except for that case.
Robert J. Walker
A: 
(?:(?<!s)and\s+|\,\s+)

Might work

Don't have a way to test it, but took out the just space matcher.

Quintin Robinson
Sorry, it doesn't. It breaks up the individual words as separate tokens ("white" "sand") instead of keeping them together ("white sand").
Robert J. Walker
Oh, my bad I misread..
Quintin Robinson
+5  A: 

This should be pretty resilient, and handle stuff like delimiters at the end of the string ("foo and bar and ", for example)

\s*(?:\band\b|,)\s*
Dan
Pretty close. Using the string ",white sand, tall waves and warm sun and " causes an empty String match at the beginning, but that's workable. I've upvoted your answer as a result. It basically produces the same results as nsayer's answer.
Robert J. Walker
Actually, your answer is even more robust. It even handles ",white sand, tall waves,and warm sun and " reasonably. Good show!
Robert J. Walker
A: 

Maybe:

((\s*,\s*)|(\s+and\s+))

I'm not a java programmer, so I'm not sure if java regex allows '?'

Lucas Oman
Except for an extra set of parentheses, your answer is identical to nsayer's. Not that it's a bad answer. :)
Robert J. Walker