ansaurus

Question

eliminating multiple occurrences of whitespace in a string in python

Answer 1

+5 A:

re.sub(r'\s+', ' ', 'this is   a    string')

You can pre-compile and store this for potentially better performance:

MULT_SPACES = re.compile(r'\s+')
MULT_SPACES.sub(' ', 'this is   a    string')

Matthew Flaschen 2010-06-01 15:32:00

A compiled python regex object inherits all the regex methods. So, you could simplify the substitution to: MULT_SPACES.sub(' ', 'this is a string')

Josh Wright 2010-06-01 17:33:53

Answer 2

+11 A:

You could use string.split and " ".join(list) to make this happen in a reasonably pythonic way - there are probably more efficient algorithms but they won't look as nice.

Incidentally, this is a lot faster than using a regex, at least on the sample string:

import re
import timeit

s = "this    is   a     string"

def do_regex():
    for x in xrange(100000):
        a = re.sub(r'\s+', ' ', s)

def do_join():
    for x in xrange(100000):
        a = " ".join(s.split())


if __name__ == '__main__':
    t1 = timeit.Timer(do_regex).timeit(number=5)
    print "Regex: ", t1
    t2 = timeit.Timer(do_join).timeit(number=5)
    print "Join: ", t2


$ python revsjoin.py 
Regex:  2.70868492126
Join:  0.333452224731

Compiling this regex does improve performance, but only if you do call sub on the compiled regex, instead of passing the compiled form into re.sub as an argument:

def do_regex_compile():
  pattern = re.compile(r'\s+')
  for x in xrange(100000):
    # Don't do this
    # a = re.sub(pattern, ' ', s)
    a = pattern.sub(' ', s)

$ python revsjoin.py  
Regex:  2.72924399376
Compiled Regex:  1.5852200985
Join:  0.33763718605

Nick Bastin 2010-06-01 15:32:30

Why would you compile it for every call? The whole point of compiling it is to reuse it. Compiling the regex object once cuts the regex runtime in half on my system (it's still almost 3 times as slow as the string methods) (EDIT: Misread your compiling explanation... interesting that we got different results though)

Josh Wright 2010-06-01 17:38:32

Yeah, I only compile it for every "do_regex" call, which is still using the compiled version 100k times for every compile - this ended up being 2.81 seconds on my system.

Nick Bastin 2010-06-01 17:45:41

(Added my compiling results to the answer with faster `compiled.sub()` pattern)

Nick Bastin 2010-06-01 18:01:57

Answer 3

+1 A:

Try this:

s = "this is   a    string"
tokens = s.split()
neat_s = " ".join(tokens)

The string's split function will return a list of non empty tokens split by whitespace. So if you try

"this is   a    string".split()

you will get back

['this', 'is', 'a', 'string']

The string's join function will join a list of tokens together using the string itself as a delimiter. In this case we want a space, so

" ".join("this is   a    string".split())

Will split on occurrences of a space, discard the empties, then join again, separating by spaces. For more about string operations, check out Python's common string function documentation.

EDIT: I misunderstood what happens when you pass a delimiter to the split function. See markuz's answer for this.

Ben Gartner 2010-06-01 15:47:20

Answer 4

+1 A:

Pretty the same answer by Ben Gartner, but, this adds the "if this is not an empty string" check.

>>> a = 'this is   a    string'
>>> ' '.join([k for k in a.split(" ") if k])
'this is a string'
>>>

if you don't check for empty strings you'll get this:

>>> ' '.join([k for k in a.split(" ")])
'this is   a    string'
>>>

markuz 2010-06-01 17:57:52

It seems if you do a.split(), the empty strings are removed automatically. If you do a.split(" ") they are include. I'll update my answer to reflect.

Ben Gartner 2010-06-01 18:01:12

As per the documentation: "If `sep` is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings"

Nick Bastin 2010-06-01 18:10:29

ansaurus

tags:

views:

answers:

eliminating multiple occurrences of whitespace in a string in python

related questions