tags:

views:

316

answers:

12

I need to split a string like this, on semicolons. But I don't what to split on semicolons that are inside of a string (' or "). I'm not parsing a file; just a simple string with no line breaks.

part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

Result should be:

  • part 1
  • "this is ; part 2"
  • 'this is ; part 3'
  • part 4
  • this "is ; part" 5

I suppose this can be done with a regex but if not; I'm open to another approach.

+1  A: 

This regex will do that: (?:^|;)("(?:[^"]+|"")*"|[^;]*)

drewk
You'll want to add another option for single quotes as well.
Amber
Which will then break, unless you can use backreferences in python's `re` module (which don't appear documented). The second you support both types of quotes, you could potentially match this `"quoted'` vs `"quoted' single quote"`
xyld
Also see http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns
killdash10
@xyld: Python's `re` module does support backreferences. @killdash10: That's irrelevant. The OP is not trying to parse nested patterns.
Max Shawabkeh
@killdash10 exactly, but with backreferences in perl, you can do it ;) Breaks the whole pumping lemma, DFA/NFA thing because the regular expression has state, very small/limited state, but state none-the-less
xyld
That won't work if you have escaped quotes inside a string. Think `"s\"r\\\"g\\\"\""`. I think regex is the wrong approach here because regular expressions can't count and can't recurse. Regular expressions can't jump, if you will.
wilhelmtell
@max they didn't look documented? Can you post a link?
xyld
Also: fails on the following string: `'''part 1;"this is ';' part 2;";'this is "part" 3';part 4'''`
Amber
@xyld: See the explanation of `(...)` here: http://docs.python.org/library/re.html#regular-expression-syntax
Max Shawabkeh
Well sure enough, then its possible with a `re.findall()`, but definitely not **one** regex search across the string... You can search it multiple times with one regex and do it. I dont know of a great way to do this any other way in python and be efficient?
xyld
+2  A: 

While it could be done with PCRE via lookaheads/behinds/backreferences, it's not really actually a task that regex is designed for due to the need to match balanced pairs of quotes.

Instead it's probably best to just make a mini state machine and parse through the string like that.

Edit

As it turns out, due to the handy additional feature of Python re.findall which guarantees non-overlapping matches, this can be more straightforward to do with a regex in Python than it might otherwise be. See comments for details.

However, if you're curious about what a non-regex implementation might look like:

x = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

results = [[]]
quote = None
for c in x:
  if c == "'" or c == '"':
    if c == quote:
      quote = None
    elif quote == None:
      quote = c
  elif c == ';':
    if quote == None:
      results.append([])
      continue
  results[-1].append(c)

results = [''.join(x) for x in results]

# results = ['part 1', '"this is ; part 2;"', "'this is ; part 3'",
#            'part 4', 'this "is ; part" 5']
Amber
The question does not require balancing at all - just enclosing and single-character escaping. It's a pretty straightforward (and actually formally regular) pattern.
Max Shawabkeh
Actually, the only reason `findall` works is due to the additional restriction implemented in Python that the returned matches be *non-overlapping*. Otherwise, a string like `'''part 1;"this 'is' sparta";part 2'''` would fail due to the pattern also matching a subset of the string.
Amber
I'm using `findall` because we need to extract the string. Formally, regular expressions only do matching. To match, we can simply use `^mypattern(;mypattern)*$`.
Max Shawabkeh
However, doing so gives up, as you point out, the ability to extract the text in a nice manner (though I suppose you could iterate through an indefinite number of captures).
Amber
Oh, yours is much nicer than mine. :)
Ipsquiggle
+2  A: 
>>> x = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''
>>> import re
>>> re.findall(r'''(?:[^;'"]+|'(?:[^']|\\.)*'|"(?:[^']|\\.)*")+''', x)
['part 1', "this is ';' part 2", "'this is ; part 3'", 'part 4', 'this "is ; part" 5']
Max Shawabkeh
Fails on the following string: `'''part 1;"this is ';' part 2;";'this is ; part 3';part 4'''`
Amber
Right. Fixed. Forgot to swap the single/double quotes in the second part.
Max Shawabkeh
I'm sorry, I missed something in my test case. See part 5 in my question. Thanks
Sly
Your 5th test case is probably going to render this solution much less viable.
Amber
Ok, I really just want to ignore semicolons inside quotes. I don't want quotes to act as separators.
Sly
@Sly: Updated to support #5.
Max Shawabkeh
A: 

This seemed to me an semi-elegant solution.

New Solution:

import re
reg = re.compile('(\'|").*?\\1')
pp = re.compile('.*?;')
def splitter(string):
    #add a last semicolon
    string += ';'
    replaces = []
    s = string
    i = 1
    #replace the content of each quote for a code
    for quote in reg.finditer(string):
        out = string[quote.start():quote.end()]
        s = s.replace(out, '**' + str(i) + '**')
        replaces.append(out)
        i+=1
    #split the string without quotes
    res = pp.findall(s)

    #add the quotes again
    #TODO this part could be faster.
    #(lineal instead of quadratic)
    i = 1
    for replace in replaces:
        for x in range(len(res)):
            res[x] = res[x].replace('**' + str(i) + '**', replace)
        i+=1
    return res

Old solution:

I choose to match if there was an opening quote and wait it to close, and the match an ending semicolon. each "part" you want to match needs to end in semicolon. so this match things like this :

  • 'foobar;.sska';
  • "akjshd;asjkdhkj..,";
  • asdkjhakjhajsd.jhdf;

Code:

mm = re.compile('''((?P<quote>'|")?.*?(?(quote)\\2|);)''')
res = mm.findall('''part 1;"this is ; part 2;";'this is ; part 3';part 4''')

you may have to do some postprocessing to res, but it contains what you want.

noinflection
A: 

Even though I'm certain there is a clean regex solution (so far I like @noiflection's answer), here is a quick-and-dirty non-regex answer.

s = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""

inQuotes = False
current = ""
results = []
currentQuote = ""
for c in s:
    if not inQuotes and c == ";":
        results.append(current)
        current = ""
    elif not inQuotes and (c == '"' or c == "'"):
        currentQuote = c
        inQuotes = True
    elif inQuotes and c == currentQuote:
        currentQuote = ""
        inQuotes = False
    else:
        current += c

results.append(current)

print results
# ['part 1', 'this is ; part 2;', 'this is ; part 3', 'part 4', 'this is ; part 5']

(I've never put together something of this sort, feel free to critique my form!)

Ipsquiggle
+2  A: 

You appears to have a semi-colon seperated string. Why not use the csv module to do all the hard work?

Off the top of my head, this should work

import csv 
from StringIO import StringIO 

line = '''part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5'''

data = StringIO(line) 
reader = csv.reader(data, delimiter=';') 
for row in reader: 
    print row 

This should give you something like
("part 1", "this is ; part 2;", 'this is ; part 3', "part 4", "this \"is ; part\" 5")

Edit:
Unfortunately, this doesn't quite work, (even if you do use StringIO, as I intended), due to the mixed string quotes (both single and double). What you actually get is

['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5'].

If you can change the data to only contain single or double quotes at the appropriate places, it should work fine, but that sort of negates the question a bit.

Simon Callan
+1: csv.reader takes an iterable, so you need to wrap the input string in a list: `csv.reader([data], delimiter=';')`. Apart from that it does exactly what the user wants. This will also handle embedded quotes characters prefixed with a backslash.
Dave Kirby
actually, csv module isn't that smart, doesn't work when I tested. his data has both single quotes and double quotes, and csv module cannot handle `this "is ; part" 5` as single block, which result in `['part 1', 'this is ; part 2;', "'this is ", " part 3'", 'part 4', 'this "is ', ' part" 5']`
S.Mark
The csv module not only doesn't handle more than one quote type, but it also insists that fields are entirely quoted or not quoted at all. That means part 5 will be split in two because a double quote in the middle of a field is just a literal not quoting the content.I'm afraid in this case the options are (a) use an excessively complex regular expression, or (b) get the format of the input data changed to use some recognisable variant of CSV. If it was me I'd go for option (b).
Duncan
A: 

My approach is to replace all non-quoted occurrences of the semi-colon with another character which will never appear in the text, then split on that character. The following code uses the re.sub function with a function argument to search and replace all occurrences of a srch string, not enclosed in single or double quotes or parens, brackets or braces, with a repl string:

def srchrepl(srch, repl, string):
    """
    Replace non-bracketed/quoted occurrences of srch with repl in string.
    """
    resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
                          + srch + """])|(?P<rbrkt>[)\]}])""")
    return resrchrepl.sub(_subfact(repl), string)


def _subfact(repl):
    """
    Replacement function factory for regex sub method in srchrepl.
    """
    level = 0
    qtflags = 0
    def subf(mo):
        nonlocal level, qtflags
        sepfound = mo.group('sep')
        if  sepfound:
            if level == 0 and qtflags == 0:
                return repl
            else:
                return mo.group(0)
        elif mo.group('lbrkt'):
            if qtflags == 0:
                level += 1
            return mo.group(0)
        elif mo.group('quote') == "'":
            qtflags ^= 1            # toggle bit 1
            return "'"
        elif mo.group('quote') == '"':
            qtflags ^= 2            # toggle bit 2
            return '"'
        elif mo.group('rbrkt'):
            if qtflags == 0:
                level -= 1
            return mo.group(0)
    return subf

If you don't care about the bracketed characters, you can simplify this code a lot.
Say you wanted to use a pipe or vertical bar as the substitute character, you would do:

mylist = srchrepl(';', '|', mytext).split('|')

BTW, this uses nonlocal from Python 3.1, change it to global if you need to.

Don O'Donnell
+5  A: 

Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.

Note that it is much easier here to match a field than it is to match a separator:

import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]

and the output is:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']
Duncan
+1, I like this one, quite clean and make sense for me.
S.Mark
oh btw, `[^;"']+` would be better than `([^;"']...)+` I think
S.Mark
I don't think that `[^;"']+` helps. You still need the + outside the group to handle something that is a mix of ordinary characters and quoted elements. Elements which can repeat and themselves contain repeats are a great way to kill regular expression matching so should be avoided when possible.
Duncan
+1  A: 
re.split(''';(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', data)
Alan Moore
+3  A: 

Here is an annotated pyparsing approach:

from pyparsing import (printables, originalTextFor, OneOrMore, 
    quotedString, Word, delimitedList)

# unquoted words can contain anything but a semicolon
printables_less_semicolon = printables.replace(';','')

# capture content between ';'s, and preserve original text
content = originalTextFor(
    OneOrMore(quotedString | Word(printables_less_semicolon)))

# process the string
print delimitedList(content, ';').parseString(test)

giving

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 
 'this "is ; part" 5']

By using pyparsing's provided quotedString, you also get support for escaped quotes.

You also were unclear how to handle leading whitespace before or after a semicolon delimiter, and none of your fields in your sample text has any. Pyparsing would parse "a; b ; c" as:

['a', 'b', 'c']
Paul McGuire
+1 I was about to post a pyparsing solution but yours is more elegant
Luper Rouch
+1  A: 

since you do not have '\n', use it to replace any ';' that is not in a quote string

>>> new_s = ''
>>> is_open = False

>>> for c in s:
...     if c == ';' and not is_open:
...         c = '\n'
...     elif c in ('"',"'"):
...         is_open = not is_open
...     new_s += c

>>> result = new_s.split('\n')

>>> result
['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']
remosu
Clean and simple. Since it's just a simple string, no need to worry about efficiency. To handle nested quotes, may need to tweak the elif statement.
Dingle
A: 

i am new on the stack and hence not worthy of commenting - hence i have to use this route:

the top answer fails to parse empty strings (note two semicolons): part 1;;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5

pls fix your solution