tags:

views:

77

answers:

4

As the title says, I need a way to remove all whitespace from a string, except when that whitespace is between quotes.

result = re.sub('".*?"', "", content)

This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..

+1  A: 

You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.

print " ".join(shlex.split('Hello "world     this    is" a    test'))
Ivo van der Wijk
Your example gave me 'Hello world this is a test' instead of 'Hello"world this is"atest'
Oli
+2  A: 

I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.

import re

def stripwhite(text):
    lst = text.split('"')
    for i, item in enumerate(lst):
        if not i % 2:
            lst[i] = re.sub("\s+", "", item)
    return '"'.join(lst)

print stripwhite('This is a string with some "text in quotes."')
kindall
+1 for a working solution!
jathanism
Someone will be along shortly to replace it with a one-line list comprehension, I am sure. :-)
kindall
@kindall: hahaha - i actually missed the remark on the one-liner till after posting mine. I did build on your idea though. ++
Nas Banov
A: 

Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')

start, end = '"', '"'

for test in ('Hello "world this is" atest',
             'This is a string with some " text inside in quotes."',
             'This is without quote.',
             'This is sentence with bad "quote'):
    result = ''

    while start in test :
        clean, _, test = test.partition(start)
        clean = clean.replace(' ','') + start
        inside, tag, test = test.partition(end)
        if not tag:
            raise SyntaxError, 'Missing end quote %s' % end
        else:
            clean += inside + tag # inside not removing of white space
        result += clean
    result += test.replace(' ','')
    print result
Tony Veijalainen
+1  A: 

Here is a one-liner version, based on @kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:

stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
    for i,it in enumerate(txt.split('"'))  )

Usage example:

>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
Nas Banov
I regret that I have but one upvote to give for your solution.
kindall