tags:

views:

2636

answers:

10

I have a string vaguely like this:

foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"

that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)

the above string should split into:

foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"

note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

+13  A: 

Try:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
        String[] tokens = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

Output:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes in ahead of it.

Needless to say, it won't work if your Strings can contain escaped quotes. In that case, a proper CSV parser should be used.

Or, a bit friendlier for the eyes:

public class Main { 
    public static void main(String[] args) {
        String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

        String otherThanQuote = " [^\"] ";
        String quotedString = String.format(" \" %s* \" ", otherThanQuote);
        String regex = String.format("(?x) "+ // enable comments, ignore white spaces
                ",                         "+ // match a comma
                "(?=                       "+ // start positive look ahead
                "  (                       "+ //   start group 1
                "    %s*                   "+ //     match 'otherThanQuote' zero or more times
                "    %s                    "+ //     match 'quotedString'
                "  )*                      "+ //   end group 1 and repeat it zero or more times
                "  %s*                     "+ //   match 'otherThanQuote'
                "  $                       "+ // match the end of the string
                ")                         ", // stop positive look ahead
                otherThanQuote, quotedString, otherThanQuote);

        String[] tokens = line.split(regex);
        for(String t : tokens) {
            System.out.println("> "+t);
        }
    }
}

which produces the same as the first example.

Bart Kiers
amazing!!!!!!!!!!!!
Jason S
neat, thanks for the detailed explanation.
Jason S
According to RFC 4180: Sec 2.6: "Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes."Sec 2.7: "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote"So, if `String line = "equals: =,\"quote: \"\"\",\"comma: ,\""`, all you need to do is strip off the extraneous double quote characters.
Paul Hanbury
@Bart: my point being that your solution still works, even with embedded quotes
Paul Hanbury
solution still works for RFC 4180-format CSVs, that is.
Jason S
Paul, I didn't know of an RFC for CSV parsing, thanks for the info. Yes, then my solution will still work. Although in practice, other escape chars, like `\\`, are used in which case regex is not an appropriate tool for this job.
Bart Kiers
And you're welcome Jason!
Bart Kiers
A: 

i would do something like this: boolean foundQuote = false;

if(charAtIndex(currentStringIndex) == '"') { foundQuote = true; }

if(foundQuote == true) { //do nothing }

else

{ string[] split = currentString.split(',');
}

Woot4Moo
+10  A: 

http://sourceforge.net/projects/javacsv/

http://opencsv.sourceforge.net/

http://stackoverflow.com/questions/101100/csv-api-for-java

http://stackoverflow.com/questions/200609/can-you-recommend-a-java-library-for-reading-and-possibly-writing-csv-files

http://stackoverflow.com/questions/123/csv-file-to-xml

Jonathan Feinberg
Good call recognizing that the OP was parsing a CSV file. An external library is extremely appropriate for this task.
Stefan Kendall
well, it's not a CSV file, but thanks, great answer!
Jason S
(just a single string)
Jason S
But the string is a CSV string; you should be able to use a CSV api on that string directly.
Michael Brewer-Davis
yes, but this task is simple enough, and a much smaller part of a larger application, that I don't feel like pulling in another external library.
Jason S
If the task was simple enough, then you wouldn't be asking the question here...
Thorbjørn Ravn Andersen
not necessarily... my skills are often adequate, but they benefit from being honed.
Jason S
A: 

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.

After you split on comma, replace all mapped identifiers with the original string values.

Stefan Kendall
and how to find quote groupings without crazy regexS?
Kai
For each character, if character is quote, find next quote and replace with grouping. If no next quote, done.
Stefan Kendall
A: 

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

Matthew Sowders
+1  A: 

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.

If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

djna
The jack library linked is now called https://javacc.dev.java.net/
Nathan Voxland
+2  A: 

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):

final static private Pattern splitSearchPattern = Pattern.compile("[\",]"); 
private List<String> splitByCommasNotInQuotes(String s) {
 if (s == null)
  return Collections.emptyList();

 List<String> list = new ArrayList<String>();
 Matcher m = splitSearchPattern.matcher(s);
 int pos = 0;
 boolean quoteMode = false;
 while (m.find())
 {
  String sep = m.group();
  if ("\"".equals(sep))
  {
   quoteMode = !quoteMode;
  }
  else if (!quoteMode && ",".equals(sep))
  {
   int toPos = m.start(); 
   list.add(s.substring(pos, toPos));
   pos = m.end();
  }
 }
 if (pos < s.length())
  list.add(s.substring(pos));
 return list;
}

(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Jason S
A: 

The answer given by Bart K. works for parsing CSV files which have commas embedded in fields enclosed in double quote marks - specifically:

String[] tokens = line.split(",(?=([^\"]\"[^\"]\")[^\"]$)");

Carl R.
+1  A: 

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
    if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
    boolean atLastChar = (current == input.length() - 1);
    if(atLastChar) result.add(input.substring(start));
    else if (input.charAt(current) == ',' && !inQuotes) {
        result.add(input.substring(start, current));
        start = current + 1;
    }
}

If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:

String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
    char currentChar = builder.charAt(currentIndex);
    if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
    if (currentChar == ',' && inQuotes) {
        builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
    }
}
List<String> result = Arrays.asList(builder.toString().split(","));
Fabian Steeg
A: 

String[] tokens = line.split(",(?=([^\"]\"[^\"]\")[^\"]$)");

Here I don't understand the how $ is being used. Can some one explain this ?

Dinu John
Please don't post an answer if you have a comment or question.
Jason S