views:

635

answers:

7

For example, I want to split

str = '"a,b,c",d,e,f'

into

["a,b,c",'d','e','f']

(i.e. don't split the quoted part) In this case, this can be done with

re.findall('".*?"|[^,]+',str)

However, if

str = '"a,,b,c",d,,f'

I want

["a,,b,c",'d','','f']

i.e. I want a behavior that is like python's split function. Is there any way I can do this in one (small) line, possibly using Python's re library?

Actually, I just realized (on this site) that the csv module is perfect for what I want to do, but I am curious whether there is a regular expression that re can use to do it as well.

+2  A: 

Use the csv module as it is a real parser. Regular expressions are nonoptimal (or completely unsuited) for most things involving matching delimiters in which the rules change (I'm unsure as to whether this particular grammar is regular or not). You might be able to create a regex that would work in this case, but it would be rather complex (especially dealing with cases like "He said, \"How are you\"").

Ben Hughes
This regular expression seems to work for this case, but I haven't tested it extensively. CSV is right about at the point where I start to get uneasy with using regular expressions http://www.codeguru.com/columns/dotnettips/article.php/c8153
Ben Hughes
+1  A: 

Writing a state machine for this would, on the other hand, seem to be quite straightforward. DFAs and regexes have the same power, but usually one of them is better suited to the problem at hand, and is usually very dependent on the additional logic you might need to implement.

oggy
A: 

You can get close using non-greedy specifiers. The closest I've got is:

>>> re.findall('(".*?"|.*?)(?:,|$)',  '"a,b,c",d,e,f')
['"a,,b,c"', 'd', '', 'f', '']

But as you see, you end up with a redundant empty string at the end, which is indistinguishable from the result you get when the string ends with a comma:

>>> re.findall('(".*?"|.*?)(?:,|$)', '"a,b,c",d,e,f,')
['"a,,b,c"', 'd', '', 'f', '']

so you'd need to do some manual tweaking at the end - something like:

matches = regex,findall(s)
if not s.endswith(","): matches.pop()

or

matches = regex.findall(s+",")[:-1]

There's probably a better way.

Brian
A: 

Here's a function that'll accomplish the task:

def smart_split(data, delimiter=","):
    """ Performs splitting with string preservation. This reads both single and
        double quoted strings.
    """
    result = []
    quote_type = None
    buffer = ""
    position = 0
    while position < len(data):
        if data[position] in ["\"", "'"]:
            quote_type = data[position]
            while quote_type is not None:
                position += 1
                if data[position] == quote_type:
                    quote_type = None
                    position += 1
                else:
                    buffer += data[position]
        if data[position] == delimiter:
            result.append(buffer)
            buffer = ""
        else:
            buffer += data[position]
        position += 1
    result.append(buffer)
    return result

Example of use:

str = '"a,b,c",d,e,f'
print smart_split(str)
# Prints: ['a,b,c', 'd', 'e', 'f']
Evan Fosmark
+1  A: 
re.split(',(?=(?:[^"]*"[^"]*")*[^"]*$)', str)

After matching a comma, if there's an odd number of quotation marks up ahead ahead, the comma must be inside a pair of quotation marks, so it doesn't count as a delimiter. Obviously this doesn't take the possibility of escaped quotation marks into account, but that can handled if need be--it just makes the regex about twice as ugly as it already is. :D

Alan Moore
A: 

Here's a really short function that will do the same thing:

def split (aString):
    splitByQuotes = (",%s,"%aString).split('"')
    splitByQuotes[0::2] = [x.split(",")[1:-1] for x in splitByQuotes[0::2]]
    return [a.strip() \
        for b in splitByQuotes \
        for a in (b if type(b)==list else [b])]

It splits the string where the quotes are, creating a list where every even element is the stuff outside the quotes and every odd element is the stuff that was encapsulated within quotes. The stuff in quotes it leaves alone, the stuff outside it splits where the commas are. Now we have a list of alternating lists and strings, which we then unwrap with the last line. The reason for wrapping the string in commas at the beginning and removing commas in the middle is to prevent spare empty elements in the list. It should be able to handle whitespace - I added a strip() function at the end to make it produce clean output, but that's not necessary.

usage:

>>> print split('c, , "a,,b,c",d,"moo","f"')
['c', '', 'a,,b,c', 'd', 'moo', 'f']
Markus
+1  A: 

Page 271 of Friedl's Mastering Regular Expressions has a regular expression for extracting possibly quoted CSV fields, but it requires a bit of postprocessing:

>>> re.findall('(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))',str)
[('a,b,c', ''), ('', 'd'), ('', 'e'), ('', 'f')]
>>> re.findall('(?:^|,)(?:"((?:[^"]|"")*)"|([^",]*))','"a,b,c",d,,f')
[('a,b,c', ''), ('', 'd'), ('', ''), ('', 'f')]

Same pattern with the verbose flag:

csv = re.compile(r"""
    (?:^|,)
    (?: # now match either a double-quoted field
        # (inside, paired double quotes are allowed)...
        " # (double-quoted field's opening quote)
          (    (?: [^"] | "" )*    )
        " # (double-quoted field's closing quote)
    |
      # ...or some non-quote/non-comma text...
        ( [^",]* )
    )""", re.X)
Greg Bacon