ansaurus

Question

How to split but ignore separators in quoted strings, in python?

Answer 1

+1 A:

This regex will do that: (?:^|;)("(?:[^"]+|"")*"|[^;]*)

drewk 2010-05-07 02:18:19

You'll want to add another option for single quotes as well.

Amber 2010-05-07 02:25:36

Which will then break, unless you can use backreferences in python's `re` module (which don't appear documented). The second you support both types of quotes, you could potentially match this `"quoted'` vs `"quoted' single quote"`

xyld 2010-05-07 02:28:03

Also see http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns

killdash10 2010-05-07 02:29:52

@xyld: Python's `re` module does support backreferences. @killdash10: That's irrelevant. The OP is not trying to parse nested patterns.

Max Shawabkeh 2010-05-07 02:31:06

@killdash10 exactly, but with backreferences in perl, you can do it ;) Breaks the whole pumping lemma, DFA/NFA thing because the regular expression has state, very small/limited state, but state none-the-less

xyld 2010-05-07 02:32:08

That won't work if you have escaped quotes inside a string. Think `"s\"r\\\"g\\\"\""`. I think regex is the wrong approach here because regular expressions can't count and can't recurse. Regular expressions can't jump, if you will.

wilhelmtell 2010-05-07 02:32:35

@max they didn't look documented? Can you post a link?

xyld 2010-05-07 02:32:52

Also: fails on the following string: `'''part 1;"this is ';' part 2;";'this is "part" 3';part 4'''`

Amber 2010-05-07 02:33:48

@xyld: See the explanation of `(...)` here: http://docs.python.org/library/re.html#regular-expression-syntax

Max Shawabkeh 2010-05-07 02:36:27

Well sure enough, then its possible with a `re.findall()`, but definitely not **one** regex search across the string... You can search it multiple times with one regex and do it. I dont know of a great way to do this any other way in python and be efficient?

xyld 2010-05-07 02:38:40

Answer 2

+2 A:

While it could be done with PCRE via lookaheads/behinds/backreferences, it's not really actually a task that regex is designed for due to the need to match balanced pairs of quotes.

Instead it's probably best to just make a mini state machine and parse through the string like that.

Edit

As it turns out, due to the handy additional feature of Python re.findall which guarantees non-overlapping matches, this can be more straightforward to do with a regex in Python than it might otherwise be. See comments for details.

However, if you're curious about what a non-regex implementation might look like:

x = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

results = [[]]
quote = None
for c in x:
  if c == "'" or c == '"':
    if c == quote:
      quote = None
    elif quote == None:
      quote = c
  elif c == ';':
    if quote == None:
      results.append([])
      continue
  results[-1].append(c)

results = [''.join(x) for x in results]

# results = ['part 1', '"this is ; part 2;"', "'this is ; part 3'",
#            'part 4', 'this "is ; part" 5']

Amber 2010-05-07 02:18:30

The question does not require balancing at all - just enclosing and single-character escaping. It's a pretty straightforward (and actually formally regular) pattern.

Max Shawabkeh 2010-05-07 02:38:49

Actually, the only reason `findall` works is due to the additional restriction implemented in Python that the returned matches be *non-overlapping*. Otherwise, a string like `'''part 1;"this 'is' sparta";part 2'''` would fail due to the pattern also matching a subset of the string.

Amber 2010-05-07 02:45:29

I'm using `findall` because we need to extract the string. Formally, regular expressions only do matching. To match, we can simply use `^mypattern(;mypattern)*$`.

Max Shawabkeh 2010-05-07 02:48:19

However, doing so gives up, as you point out, the ability to extract the text in a nice manner (though I suppose you could iterate through an indefinite number of captures).

Amber 2010-05-07 02:51:06

Oh, yours is much nicer than mine. :)

Ipsquiggle 2010-05-07 03:12:29

Answer 3

+2 A:

>>> x = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''
>>> import re
>>> re.findall(r'''(?:[^;'"]+|'(?:[^']|\\.)*'|"(?:[^']|\\.)*")+''', x)
['part 1', "this is ';' part 2", "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

Max Shawabkeh 2010-05-07 02:30:20

Fails on the following string: `'''part 1;"this is ';' part 2;";'this is ; part 3';part 4'''`

Amber 2010-05-07 02:33:15

Right. Fixed. Forgot to swap the single/double quotes in the second part.

Max Shawabkeh 2010-05-07 02:34:36

I'm sorry, I missed something in my test case. See part 5 in my question. Thanks

Sly 2010-05-07 02:51:06

Your 5th test case is probably going to render this solution much less viable.

Amber 2010-05-07 02:55:20

Ok, I really just want to ignore semicolons inside quotes. I don't want quotes to act as separators.

Sly 2010-05-07 02:56:58

@Sly: Updated to support #5.

Max Shawabkeh 2010-05-07 02:58:41

Answer 4

A:

This seemed to me an semi-elegant solution.

New Solution:

import re
reg = re.compile('(\'|").*?\\1')
pp = re.compile('.*?;')
def splitter(string):
    #add a last semicolon
    string += ';'
    replaces = []
    s = string
    i = 1
    #replace the content of each quote for a code
    for quote in reg.finditer(string):
        out = string[quote.start():quote.end()]
        s = s.replace(out, '**' + str(i) + '**')
        replaces.append(out)
        i+=1
    #split the string without quotes
    res = pp.findall(s)

    #add the quotes again
    #TODO this part could be faster.
    #(lineal instead of quadratic)
    i = 1
    for replace in replaces:
        for x in range(len(res)):
            res[x] = res[x].replace('**' + str(i) + '**', replace)
        i+=1
    return res

Old solution:

I choose to match if there was an opening quote and wait it to close, and the match an ending semicolon. each "part" you want to match needs to end in semicolon. so this match things like this :

'foobar;.sska';
"akjshd;asjkdhkj..,";
asdkjhakjhajsd.jhdf;

Code:

mm = re.compile('''((?P<quote>'|")?.*?(?(quote)\\2|);)''')
res = mm.findall('''part 1;"this is ; part 2;";'this is ; part 3';part 4''')

you may have to do some postprocessing to res, but it contains what you want.

noinflection 2010-05-07 02:56:24

Answer 5

A:

Even though I'm certain there is a clean regex solution (so far I like @noiflection's answer), here is a quick-and-dirty non-regex answer.

s = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

inQuotes = False
current = ""
results = []
currentQuote = ""
for c in s:
    if not inQuotes and c == ";":
        results.append(current)
        current = ""
    elif not inQuotes and (c == '"' or c == "'"):
        currentQuote = c
        inQuotes = True
    elif inQuotes and c == currentQuote:
        currentQuote = ""
        inQuotes = False
    else:
        current += c

results.append(current)

print results
# ['part 1', 'this is ; part 2;', 'this is ; part 3', 'part 4', 'this is ; part 5']

(I've never put together something of this sort, feel free to critique my form!)

Ipsquiggle 2010-05-07 03:11:00

Answer 6

+2 A:

You appears to have a semi-colon seperated string. Why not use the csv module to do all the hard work?

Off the top of my head, this should work

import csv 
from StringIO import StringIO 

line = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''

data = StringIO(line) 
reader = csv.reader(data, delimiter=';') 
for row in reader: 
    print row

This should give you something like
("part 1", "this is ; part 2;", 'this is ; part 3', "part 4", "this \"is ; part\" 5")

Edit:
Unfortunately, this doesn't quite work, (even if you do use StringIO, as I intended), due to the mixed string quotes (both single and double). What you actually get is

['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5'].

If you can change the data to only contain single or double quotes at the appropriate places, it should work fine, but that sort of negates the question a bit.

Simon Callan 2010-05-07 06:22:48

+1: csv.reader takes an iterable, so you need to wrap the input string in a list: `csv.reader([data], delimiter=';')`. Apart from that it does exactly what the user wants. This will also handle embedded quotes characters prefixed with a backslash.

Dave Kirby 2010-05-07 06:35:51

actually, csv module isn't that smart, doesn't work when I tested. his data has both single quotes and double quotes, and csv module cannot handle `this "is ; part" 5` as single block, which result in `['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5']`

S.Mark 2010-05-07 06:38:30

The csv module not only doesn't handle more than one quote type, but it also insists that fields are entirely quoted or not quoted at all. That means part 5 will be split in two because a double quote in the middle of a field is just a literal not quoting the content.I'm afraid in this case the options are (a) use an excessively complex regular expression, or (b) get the format of the input data changed to use some recognisable variant of CSV. If it was me I'd go for option (b).

Duncan 2010-05-07 07:48:05

Answer 7

A:

My approach is to replace all non-quoted occurrences of the semi-colon with another character which will never appear in the text, then split on that character. The following code uses the re.sub function with a function argument to search and replace all occurrences of a srch string, not enclosed in single or double quotes or parens, brackets or braces, with a repl string:

def srchrepl(srch, repl, string):
    """
    Replace non-bracketed/quoted occurrences of srch with repl in string.
    """
    resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
                          + srch + """])|(?P<rbrkt>[)\]}])""")
    return resrchrepl.sub(_subfact(repl), string)


def _subfact(repl):
    """
    Replacement function factory for regex sub method in srchrepl.
    """
    level = 0
    qtflags = 0
    def subf(mo):
        nonlocal level, qtflags
        sepfound = mo.group('sep')
        if  sepfound:
            if level == 0 and qtflags == 0:
                return repl
            else:
                return mo.group(0)
        elif mo.group('lbrkt'):
            if qtflags == 0:
                level += 1
            return mo.group(0)
        elif mo.group('quote') == "'":
            qtflags ^= 1            # toggle bit 1
            return "'"
        elif mo.group('quote') == '"':
            qtflags ^= 2            # toggle bit 2
            return '"'
        elif mo.group('rbrkt'):
            if qtflags == 0:
                level -= 1
            return mo.group(0)
    return subf

If you don't care about the bracketed characters, you can simplify this code a lot.
Say you wanted to use a pipe or vertical bar as the substitute character, you would do:

mylist = srchrepl(';', '|', mytext).split('|')

BTW, this uses nonlocal from Python 3.1, change it to global if you need to.

Don O'Donnell 2010-05-07 06:26:59

Answer 8

+5 A:

Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.

Note that it is much easier here to match a field than it is to match a separator:

import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]

and the output is:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

Duncan 2010-05-07 07:59:49

+1, I like this one, quite clean and make sense for me.

S.Mark 2010-05-07 11:06:14

oh btw, `[^;"']+` would be better than `([^;"']...)+` I think

S.Mark 2010-05-07 11:10:14

I don't think that `[^;"']+` helps. You still need the + outside the group to handle something that is a mix of ordinary characters and quoted elements. Elements which can repeat and themselves contain repeats are a great way to kill regular expression matching so should be avoided when possible.

Duncan 2010-05-07 14:42:33

Answer 9

+1 A:

re.split(''';(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', data)

Alan Moore 2010-05-07 10:57:57

Answer 10

+3 A:

Here is an annotated pyparsing approach:

from pyparsing import (printables, originalTextFor, OneOrMore, 
    quotedString, Word, delimitedList)

# unquoted words can contain anything but a semicolon
printables_less_semicolon = printables.replace(';','')

# capture content between ';'s, and preserve original text
content = originalTextFor(
    OneOrMore(quotedString | Word(printables_less_semicolon)))

# process the string
print delimitedList(content, ';').parseString(test)

giving

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 
 'this "is ; part" 5']

By using pyparsing's provided quotedString, you also get support for escaped quotes.

You also were unclear how to handle leading whitespace before or after a semicolon delimiter, and none of your fields in your sample text has any. Pyparsing would parse "a; b ; c" as:

['a', 'b', 'c']

Paul McGuire 2010-05-07 12:44:35

+1 I was about to post a pyparsing solution but yours is more elegant

Luper Rouch 2010-05-07 12:56:46

Answer 11

+1 A:

since you do not have '\n', use it to replace any ';' that is not in a quote string

>>> new_s = ''
>>> is_open = False

>>> for c in s:
...     if c == ';' and not is_open:
...         c = '\n'
...     elif c in ('"',"'"):
...         is_open = not is_open
...     new_s += c

>>> result = new_s.split('\n')

>>> result
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

remosu 2010-05-07 13:26:29

Clean and simple. Since it's just a simple string, no need to worry about efficiency. To handle nested quotes, may need to tweak the elif statement.

Dingle 2010-05-07 20:52:15

Answer 12

A:

i am new on the stack and hence not worthy of commenting - hence i have to use this route:

the top answer fails to parse empty strings (note two semicolons): part 1;;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

pls fix your solution

2010-06-14 12:12:00

ansaurus

tags:

views:

answers:

How to split but ignore separators in quoted strings, in python?

Edit

New Solution:

Old solution:

related questions