views:

1951

answers:

14

Given a string like this:

a,"string, with",various,"values, and some",quoted

What is a good algorithm to split this based on commas while ignoring the commas inside the quoted sections?

The output should be an array:

[ "a", "string, with", "various", "values, and some", "quoted" ]

+1  A: 

What if an odd number of quotes appear in the original string?

dulaneyb
That would imply an improperly quoted input string, so throwing an exception would be an option.
J c
A: 

Of course using a CSV parser is better but just for the fun of it you could:

Loop on the string letter by letter.
    If current_letter == quote : 
        toggle inside_quote variable.
    Else if (current_letter ==comma and not inside_quote) : 
        push current_word into array and clear current_word.
    Else 
        append the current_letter to current_word
When the loop is done push the current_word into array
Pat
+1  A: 

The author here dropped in a blob of C# code that handles the scenario you're having a problem with:

http://stackoverflow.com/questions/1898/csv-file-imports-in-net#6111

Shouldn't be too difficult to translate.

Kev
+1  A: 

If my language of choice didn't offer of way to do this without thinking then I would initially consider two options as the easy way out:

  1. Pre-parse and replace the commas within the string with another control character then split them, followed by a post-parse on the array to replace the control character used previously with the commas.

  2. Alternatively split them on the commas then post-parse the resulting array into another array checking for leading quotes on each array entry and concatanating the entries until I reached a terminating quote.

These are hacks however, and if this is a pure 'mental' exercise then I suspect they are unhelpful. If this is a real world problem then it would help to know the language so that we could offer some specific advice.

Martin
A: 

I use this to parse strings, not sure if it helps here; but with some minor modifications perhaps?

function getstringbetween($string, $start, $end){
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0) return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}

$fullstring = "this is my [tag]dog[/tag]";
$parsed = getstringbetween($fullstring, "[tag]", "[/tag]");

echo $parsed; // (result = dog)

/mp

mauriciopastrana
A: 

Here's a simple algorithm:

  1. Determine if the string begins with a '"' character
  2. Split the string into an array delimited by the '"' character.
  3. Mark the quoted commas with a placeholder #COMMA#
    • If the input starts with a '"', mark those items in the array where the index % 2 == 0
    • Otherwise mark those items in the array where the index % 2 == 1
  4. Concatenate the items in the array to form a modified input string.
  5. Split the string into an array delimited by the ',' character.
  6. Replace all instances in the array of #COMMA# placeholders with the ',' character.
  7. The array is your output.

Heres the python implementation:
(fixed to handle '"a,b",c,"d,e,f,h","i,j,k"')

def parse_input(input):

quote_mod = int(not input.startswith('"'))

input = input.split('"')
for item in input:
if item == '':
input.remove(item)
for i in range(len(input)):
if i % 2 == quoted_mod:
input[i] = input[i].replace(",", "#COMMA#")

input = "".join(input).split(",")
for item in input:
if item == '':
input.remove(item)
for i in range(len(input)):
input[i] = input[i].replace("#COMMA#", ",")
return input

# parse_input('a,"string, with",various,"values, and some",quoted')
# -> ['a,string', ' with,various,values', ' and some,quoted']
# parse_input('"a,b",c,"d,e,f,h","i,j,k"')
# -> ['a,b', 'c', 'd,e,f,h', 'i,j,k']
Justin Standard
A: 

This is a standard CSV-style parse. A lot of people try to do this with regular expressions. You can get to about 90% with regexes, but you really need a real CSV parser to do it properly. I found a fast, excellent C# CSV parser on CodeProject a few months ago that I highly recommend!

Simon Gillbee
There's also one in the .NET framework of course. Even though it's in Microsoft.VisualBasic you could still use it from C#. http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx
MarkJ
Thanks! As a C# guy, I always forget that there are a bunch of useful VB libraries out there that I can use. Honestly, I think they are poorly named, since they are not really VB. They are just .NET.
Simon Gillbee
A: 

Justin, that does not handle a string like this:

input = '"a,b",c,"d,e,f,h","i,j,k"'

A: 
                    # Here's one in pseudocode (a.k.a. Python) in one pass :-P

def parsecsv(instr):
i = 0
j = 0

outstrs = []

# i is fixed until a match occurs, then it advances
# up to j. j inches forward each time through:

while i < len(instr):

if j < len(instr) and instr[j] == '"':
# skip the opening quote...
j += 1
# then iterate until we find a closing quote.
while instr[j] != '"':
j += 1
if j == len(instr):
raise Exception("Unmatched double quote at end of input.")

if j == len(instr) or instr[j] == ',':
s = instr[i:j] # get the substring we've found
s = s.strip() # remove extra whitespace

# remove surrounding quotes if they're there
if len(s) > 2 and s[0] == '"' and s[-1] == '"':
s = s[1:-1]

# add it to the result
outstrs.append(s)

# skip over the comma, move i up (to where
# j will be at the end of the iteration)
i = j+1

j = j+1

return outstrs

def testcase(instr, expected):
outstr = parsecsv(instr)
print outstr
assert expected == outstr

# Doesn't handle things like '1, 2, "a, b, c" d, 2' or
# escaped quotes, but those can be added pretty easily.

testcase('a, b, "1, 2, 3", c', ['a', 'b', '1, 2, 3', 'c'])
testcase('a,b,"1, 2, 3" , c', ['a', 'b', '1, 2, 3', 'c'])

# odd number of quotes gives a "unmatched quote" exception
#testcase('a,b,"1, 2, 3" , "c', ['a', 'b', '1, 2, 3', 'c'])
Brian Jorgensen
+17  A: 

Looks like you've got some good answers here.

For those of you looking to handle your own CSV file parsing, heed the advice from the experts and Don't roll your own CSV parser.

Your first thought is, "I need to handle commas inside of quotes."

Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."

It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free FileHelpers library.

Judah Himango
+10 if it would let me :)
MarkJ
a great link on secretgeek - very amusing. but it only answers the question for those using .NET sadly.
Magnus Smith
True; though the advice applies to devs everywhere: don't role your own CSV parser. Ruby has one built-in, and there are libraries out there for Python, C++, most any widely used language.
Judah Himango
And although SecretGeek doesn't seem to know it, there's one built in to VB.NET too. http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx
MarkJ
A: 

I just couldn't resist to see if I could make it work in a Python one-liner:

arr = [i.replace("|", ",") for i in re.sub('"([^"]*)\,([^"]*)"',"\g<1>|\g<2>", str_to_test).split(",")]

Returns ['a', 'string, with', 'various', 'values, and some', 'quoted']

It works by first replacing the ',' inside quotes to another separator (|), splitting the string on ',' and replacing the | separator again.

PabloG
How do you know there aren't any | in the original string? What about escaping quotes inside quoted strings?
MarkJ
+4  A: 

Python:

import csv
reader = csv.reader(open("some.csv"))
for row in reader:
    print row
Mark Harrison
I consider this as the best answer. Is exactly what I need!
Alex. S.
+1  A: 

What if an odd number of quotes appear in the original string?

This looks uncannily like CSV parsing, which has some peculiarities to handling quoted fields. The field is only escaped if the field is delimited with double quotations, so:

field1, "field2, field3", field4, "field5, field6" field7

becomes

field1

field2, field3

field4

"field5

field6" field7

Notice if it doesn't both start and end with a quotation, then it's not a quoted field and the double quotes are simply treated as double quotes.

Insedently my code that someone linked to doesn't actually handle this correctly, if I recall correctly.

ICR
A: 

Here's a simple python implementation based on Pat's pseudocode:

def splitIgnoringSingleQuote(string, split_char, remove_quotes=False):
    string_split = []
    current_word = ""
    inside_quote = False
    for letter in string:
      if letter == "'":
        if not remove_quotes:
           current_word += letter
        if inside_quote:
          inside_quote = False
        else:
          inside_quote = True
      elif letter == split_char and not inside_quote:
        string_split.append(current_word)
        current_word = ""
      else:
        current_word += letter
    string_split.append(current_word)
    return string_split
Greg Bayer