views:

130

answers:

4

If I have a string

"this is   a    string"

How can I shorten it so that I only have one space between the words rather than multiple? (The number of white spaces is random)

"this is a string"
+5  A: 
re.sub(r'\s+', ' ', 'this is   a    string')

You can pre-compile and store this for potentially better performance:

MULT_SPACES = re.compile(r'\s+')
MULT_SPACES.sub(' ', 'this is   a    string')
Matthew Flaschen
A compiled python regex object inherits all the regex methods. So, you could simplify the substitution to: MULT_SPACES.sub(' ', 'this is a string')
Josh Wright
+11  A: 

You could use string.split and " ".join(list) to make this happen in a reasonably pythonic way - there are probably more efficient algorithms but they won't look as nice.

Incidentally, this is a lot faster than using a regex, at least on the sample string:

import re
import timeit

s = "this    is   a     string"

def do_regex():
    for x in xrange(100000):
        a = re.sub(r'\s+', ' ', s)

def do_join():
    for x in xrange(100000):
        a = " ".join(s.split())


if __name__ == '__main__':
    t1 = timeit.Timer(do_regex).timeit(number=5)
    print "Regex: ", t1
    t2 = timeit.Timer(do_join).timeit(number=5)
    print "Join: ", t2


$ python revsjoin.py 
Regex:  2.70868492126
Join:  0.333452224731

Compiling this regex does improve performance, but only if you do call sub on the compiled regex, instead of passing the compiled form into re.sub as an argument:

def do_regex_compile():
  pattern = re.compile(r'\s+')
  for x in xrange(100000):
    # Don't do this
    # a = re.sub(pattern, ' ', s)
    a = pattern.sub(' ', s)

$ python revsjoin.py  
Regex:  2.72924399376
Compiled Regex:  1.5852200985
Join:  0.33763718605
Nick Bastin
Why would you compile it for every call? The whole point of compiling it is to reuse it. Compiling the regex object once cuts the regex runtime in half on my system (it's still almost 3 times as slow as the string methods) (EDIT: Misread your compiling explanation... interesting that we got different results though)
Josh Wright
Yeah, I only compile it for every "do_regex" call, which is still using the compiled version 100k times for every compile - this ended up being 2.81 seconds on my system.
Nick Bastin
(Added my compiling results to the answer with faster `compiled.sub()` pattern)
Nick Bastin
+1  A: 

Try this:

s = "this is   a    string"
tokens = s.split()
neat_s = " ".join(tokens)

The string's split function will return a list of non empty tokens split by whitespace. So if you try

"this is   a    string".split()

you will get back

['this', 'is', 'a', 'string']

The string's join function will join a list of tokens together using the string itself as a delimiter. In this case we want a space, so

" ".join("this is   a    string".split())

Will split on occurrences of a space, discard the empties, then join again, separating by spaces. For more about string operations, check out Python's common string function documentation.

EDIT: I misunderstood what happens when you pass a delimiter to the split function. See markuz's answer for this.

Ben Gartner
+1  A: 

Pretty the same answer by Ben Gartner, but, this adds the "if this is not an empty string" check.

>>> a = 'this is   a    string'
>>> ' '.join([k for k in a.split(" ") if k])
'this is a string'
>>> 

if you don't check for empty strings you'll get this:

>>> ' '.join([k for k in a.split(" ")])
'this is   a    string'
>>>
markuz
It seems if you do a.split(), the empty strings are removed automatically. If you do a.split(" ") they are include. I'll update my answer to reflect.
Ben Gartner
As per the documentation: "If `sep` is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings"
Nick Bastin