tags:

views:

178

answers:

7

I wish to have have the following String

!cmd 45 90 "An argument" Another AndAnother "Another one in quotes" to become an array of the following

{ "!cmd", "45", "90", "An argument", "Another", "AndAnother", "Another one in quotes" }

I tried

new StringTokenizer(cmd, "\"")

but this would return "Another" and "AndAnother as "Another AndAnother" which is not the desired effect.

Thanks.

EDIT: I have changed the example yet again, this time I believe it explains the situation best although it is no different than the second example.

+1  A: 

The example you have here would just have to be split by the double quote character.

Nikolaos
for his example that would work, but that wouldn't solve this scenario:one two three "four five six" seven eight nine "ten"
Andrew Garrison
A: 

Try this:

String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String strArr[] = str.split("\"|\s");

It's kind of tricky because you need to escape the double quotes. This regular expression should tokenize the string using either a whitespace (\s) or a double quote.

You should use String's split method because it accepts regular expressions, whereas the constructor argument for delimiter in StringTokenizer doesn't. At the end of what I provided above, you can just add the following:

String s;
for(String k : strArr) {
     s += k;
}
StringTokenizer strTok = new StringTokenizer(s);
danyim
try your approach on his new example, it won't work anymore.
Andrew Garrison
But str.split("[\"\s]") will also split by spaces inside quotes...
Eyal Schneider
Yes, this answer is incorrect. I am working on a new solution. This is an interesting problem.
danyim
A: 

try this:

String str = "One two \"three four\" five \"six seven eight\" nine \"ten\"";
String[] strings = str.split("[ ]?\"[ ]?");
smp7d
It returns "One two" instead of "One and "two".
Ploo
A: 

I don't know the context of what your trying to do, but it looks like your trying to parse command line arguments. In general, this is pretty tricky with all the escaping issues; if this is your goal I'd personally look at something like JCommander.

carnold
+3  A: 

Do it the old fashioned way. Make a function that looks at each character in a for loop. If the character is a space, take everything up to that (excluding the space) and add it as an entry to the array. Note the position, and do the same again, adding that next part to the array after a space. When a double quote is encountered, mark a boolean named 'inQuote' as true, and ignore spaces when inQuote is true. When you hit quotes when inQuote is true, flag it as false and go back to breaking things up when a space is encountered. You can then extend this as necessary to support escape chars, etc.

Could this be done with a regex? I dont know, I guess. But the whole function would take less to write than this reply did.

GrandmasterB
+1: Smart can solve hard problems, but wise man avoids them.
Cloudanger
Ditto. I've written simple parsers lke this a bazillion times. Sure, you can find some open source library to do it, or come up with a clever regex, but then you've added more complexity. Why not solve simple problems with simple tools? When I need to put in a screw, I use a screwdriver, I don't search for a solar-powered fully automated screw-putter-inner robot.
Jay
A: 

In an old fashioned way:

public static String[] split(String str) {
    str += " "; // To detect last token when not quoted...
    ArrayList<String> strings = new ArrayList<String>();
    boolean inQuote = false;
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        char c = str.charAt(i);
        if (c == '\"' || c == ' ' && !inQuote) {
            if (c == '\"')
                inQuote = !inQuote;
            if (!inQuote && sb.length() > 0) {
                strings.add(sb.toString());
                sb.delete(0, sb.length());
            }
        } else
            sb.append(c);
    }
    return strings.toArray(new String[strings.size()]);
}

I assume that nested quotes are illegal, and also that empty tokens can be omitted.

Eyal Schneider
+7  A: 

It's much easier to use a java.util.regex.Matcher and do a find() rather than any kind of split in these kinds of scenario.

That is, instead of defining the pattern for the delimiter between the tokens, you define the pattern for the tokens themselves.

Here's an example:

    String text = "1 2 \"333 4\" 55 6    \"77\" 8 999";
    // 1 2 "333 4" 55 6    "77" 8 999

    String regex = "\"([^\"]*)\"|(\\S+)";

    Matcher m = Pattern.compile(regex).matcher(text);
    while (m.find()) {
        if (m.group(1) != null) {
            System.out.println("Quoted [" + m.group(1) + "]");
        } else {
            System.out.println("Plain [" + m.group(2) + "]");
        }
    }

The above prints (as seen on ideone.com):

Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]

The pattern is essentially:

"([^"]*)"|(\S+)
 \_____/  \___/
    1       2

There are 2 alternates:

  • The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
  • The second alternate matches any sequence of non-whitespace characters, captured in group 2
  • The order of the alternates matter in this pattern

Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher solution still works.

References

See also


Appendix

Note that StringTokenizer is a legacy class. It's recommended to use java.util.Scanner or String.split, or of course java.util.regex.Matcher for most flexibility.

Related questions

polygenelubricants
We have a winner! :)Thanks so much, works perfectly. Thanks for everyone else's input too, I just find this most suitable. :)
Ploo
If I'd have asked the question that would be the answer I would accept. Thanks for this, I knew there must be some better than old-fashioned way!
Rekin
@Ploo: An example of another pattern that may be of interest: `"([^"]*)"|'([^']*)'|([^"' ]+)` http://www.rubular.com/r/cjzuqus7oa : i.e. double quoted (group 1) or single quoted (group 2) or just plain (group 3). No quote escaping.
polygenelubricants
TBH I was just browsing recent questions, but I had to +1 for such a well written and comprehensive answer!
Adam