tags:

views:

129

answers:

5

I'm trying to create a regex to tokenize a string. An example string would be.

John Mary, "Name=blah;Name=blahAgain" "Hand=1,2"

I'm trying to get back:

  • John
  • Mary
  • Name=blah;Name=blahAgain
  • Hand=1,2
+1  A: 

For that specific example, I would do:

([^\s]*)\s+([^,\s]*)\s*,\s*"([^"]*)"\s+"([^"]*)"

update: modified to split Mary and John

slebetman
Thanks for the advice, but I'm using the Scanner class in java and it doesn't seem to like it.
binarymelon
@slebetman: Mary and John aren't split with your regexp
Antony Hatchkins
That's strange. I'm quite sure that the above is fairly plain old-school regexp, without any weird PCRE or egrep stuff. Are you sure you've escaped the `"` with `\"` in java?
slebetman
@Anthony: In the original question Mary and John was specified as a single token: they were not meant to be split. Obviously the spec have changed. Sigh.. never expected feature creep on SO
slebetman
I'm fairly sure I have it in correctly. Using the following string [Pattern regex = Pattern.compile("([^,]*)\\s*,\\s*\"([^\"]*)\"\\s+\"([^\"]*)\"");]
binarymelon
Sorry, don't know java. Perhaps you should add java to your question tag to catch the eyes of java coders.
slebetman
The problem is that it doesn't catch John. You need another group in the beginning, like so: ([^ ]+) ([^,\s]*)\s*,\s*"([^"]*)"\s+"([^"]*)"
ferdystschenko
@ferdystschenko: Aha, good catch. Fixed.
slebetman
A: 

This was easy:

([^ ])+
Monis Iqbal
This would get the comma after Mary as well as the quotes. Also it wouldn't catch all desired fields at once.
ferdystschenko
Actually, it catches even only a single character each time, unless you put the '+' inside the parantheses.
ferdystschenko
A: 

Since you're using Java, why not use StringTokenizer? E.g.:

StringTokenizer st = new StringTokenizer("String to tokenize", " ");
while (st.hasMoreTokens())
{
   // get next token
   String someVariable = st.nextToken();
}
Chris
A: 

This works for your example:

(\w+) (\w+), \"([^"]+)" \"([^"]+)

Do all your string have exactly the same pattern?

ferdystschenko
A: 

One possible way: split at , followed by a space or at one of space or quotation mark:

"John Mary, \"Name=blah;Name=blahAgain\" \"Hand=1,2\"".split(",\\s|[\\s\"]")
Fabian Steeg