tags:

views:

5951

answers:

8

I have a string which is like this:

this is "a test"

I'm trying to write something in Python to split it up by space while ignoring spaces within quotes. The result I'm looking for is:

['this','is','a test']

PS. I know you are going to ask "what happens if there are quotes within the quotes, well, in my application, that will never happen.

+44  A: 

You want split, from the shlex module.

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

It has some idiosyncrasies, but it should do what you want. As noted in the comments, don't use it with unicode strings or you will get bad behaviour.

Jerub
Brilliant! Exactly what I was looking for. Thanks.
Adam Pierce
It is so true that everything you possibly want as a programming is already in the python libraries.
William
Oh man, in python version 2.5.1 and greater the `shlex.split()` does not work for unicode. E.g. `shlex.split(u"test test")` produces crap such as `'t\x00e\x00s\x00t\x00', '\x00t\x00e\x00s\x00t\x00'`, see the following issue discussion for more details http://bugs.python.org/issue6988
Ciantic
+13  A: 

Have a look at the shlex module, particularly shlex.split.

>>> import shlex

>>> shlex.split('This is "a test"')

['This', 'is', 'a test']

Allen
same-time identical answer :O
orlandu63
+1  A: 

Try this:

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.append(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result
pjz
This won't work with:"This is 'a test'"
Matthew Schinckel
A: 

If you don't care about sub strings than a simple

>>> 'a short sized string with spaces '.split()

Performance:

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

Or string module

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

Performance: String module seems to perform better than string methods

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

Or you can use RE engine

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

Performance

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

For very long strings you should not load the entire string into memory and instead either split the lines or use an iterative loop

Gregory
You seem to have missed the whole point of the question. There are quoted sections in the string that need to not be split.
rjmunro
+4  A: 

Since this question is tagged with regex, I decided to try a regex approach. I first replace all the spaces in the quotes parts with \x00, then split by spaces, then replace the \x00 back to spaces in each part.

Both versions do the same thing, but splitter is a bit more readable then splitter2.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)
gooli
You should have used re.Scanner instead. It's more reliable (and I have in fact implemented a shlex-like using re.Scanner).
Devin Jeanpierre
+6  A: 

I see regex approaches here that look complex and/or wrong. This surprises me, because regex syntax can easily describe "whitespace or thing-surrounded-by-quotes", and most regex engines (including Python's) can split on a regex. So if you're going to use regexes, why not just say exactly what you mean?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

Explanation:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex probably provides more features, though.

I was thinking much the same, but would suggest instead [t.strip('"') for t in re.findall(r'[^\s"]+|"[^"]*"', 'this is "a test"')]
Darius Bacon
What does that split do when there are apostrophes inside the double quotes: He said, "Don't do that!"I think it will treat <"Don'> as one unit, won't it?
Jonathan Leffler
Jonathan: in this case, no, I made two mistakes that cancel each other out in that case: the greedy .* will go to the final ". :-) I should have said "( |\\\".*?\\\"|'.*?')". Nice catch.
+1 I'm using this because it was a heck of a lot faster than shlex.
hanleyp
+1 from me, the 2nd regex (comments) works for my needs whereas the first doesn't. As such I've edited in the second regex but left the first easily visible.
Ninefingers
P.S. this is excellent, I don't need the features of shlex, just a split like argv. I'd give it +2 if I could.
Ninefingers
that code almost looks like perl, haven't you heard of r"raw strings"?
SpliFF
Consider this data:string = r'simple "quot ed" "ignore the escape with quotes\\" "howboutthemapostrophe\'s?" "\"withescapedquotes\"" "\"with unbalanced escaped quotes"'The Jonathan / Kate / Ninefingers update botches the withescapedquotes term, into three (degenerate-quote-alone, withescapedquotes, another-degenerate).shlex.strip(string) is fine.Can that be done via re?
jackr
+2  A: 

Depending on your use case, you may also want to check out the csv module:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print row

Output:

['this', 'is', 'a string']
['and', 'more', 'stuff']
Ryan Ginstrom
A: 

Hmm, can't seem to find the "Reply" button... anyway, this answer is based on the approach by Kate, but correctly splits strings with substrings containing escaped quotes and also removes the start and end quotes of the substrings:

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

This works on strings like 'This is " a \\\"test\\\"\\\'s substring"' (the insane markup is unfortunately necessary to keep Python from removing the escapes).

If the resulting escapes in the strings in the returned list are not wanted, you can use this slightly altered version of the function:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
Cybolic