tags:

views:

89

answers:

2

If I have a string like so ..

 There                   is             a lot           of           white space.

And I want to remove all the unwanted space in Ruby's regex.. How do you identify white space and remove it so that there will still be at least one white space between all the words?

So far I have :

gsub(/\s{2,}/, '')

But as you can see that collapses several words into each other.

+9  A: 

You're close. After trimming whitespace off the left and right,

str.strip.gsub(/\s{2,}/, ' ')

replace any sets of multiple spaces with a single space. This, of course, assumes you're only dealing with actual spaces.

Matchu
Oh man! :D so close!!!
Trip
This answer works. It is clearly superior to the trivial `s.gsub(/\s+/, ' ')`.
Steven Xu
@Steven Xu - though replacing a single space with a single space can be fun, too!
Matchu
I don't understand why this is better than using /\s+/ instead?
MikeD
@MikeD - not an expert on regex internals, but replacing a space with a space seems like a waste of the computer's time, since that regex will match single spaces, as well. Since `{2,}` isn't exactly that much more complicated, and much better specifies what you're trying to do while avoiding frivolous replacements, it seems like a better idea overall.
Matchu
"This, of course, assumes you're only dealing with actual spaces." -- A `\s` will match anything considered whitespace, including tabs/newlines/etc. If the OP is dealing with multiline content and doesn't want that then either just use a literal space, or `(?!\n)\s` to do whitespace without newlines. (With the `{2,}` or `+` or whatever added on - the `\s+` to ` ` method will replace single tabs with single space, so not a waste of time if that's desired.)
Peter Boughton
@Peter Boughton - it's true. If I were doing it from scratch, I probably would have replaced the `\s` with ` `, but since he started with `\s`, I followed suit. If he's only working with spaces, this works fine, but the formatting switch would probably be wise.
Matchu
+2  A: 

Back when I was writing Perl code all the time, I used to grab regular expressions for my string manipulations. Then, one day I wanted to make some code that was searching and parsing string content, and wrote a benchmark to compare some regex and standard string index-based searches. The index-based search blew away regex. It isn't as sophisticated but sometimes we don't need sophisticated when dealing with simple problems.

Rather than instantly grab a regex a String.squeeze(' ') can handle compressing the repeated spaces a lot faster. Consider the output of benchmark:

#!/usr/bin/env ruby

require 'benchmark'

asdf = 'There                   is             a lot           of           white space.'

asdf.squeeze(' ') # => "There is a lot of white space."
asdf.gsub(/  +/, ' ') # => "There is a lot of white space."
asdf.gsub(/ {2,}/, ' ') # => "There is a lot of white space."
asdf.gsub(/\s\s+/, ' ') # => "There is a lot of white space."
asdf.gsub(/\s{2,}/, ' ') # => "There is a lot of white space."

n = 500000
Benchmark.bm(8) do |x|
  x.report('squeeze:') { n.times{ asdf.squeeze(' ') } }
  x.report('gsub1:') { n.times{ asdf.gsub(/  +/, ' ') } }
  x.report('gsub2:') { n.times{ asdf.gsub(/ {2,}/, ' ') } }
  x.report('gsub3:') { n.times{ asdf.gsub(/\s\s+/, ' ') } }
  x.report('gsub4:') { n.times{ asdf.gsub(/\s{2,}/, ' ') } }
end

puts
puts "long strings"
n     = 1000
str_x = 1000
Benchmark.bm(8) do |x|
  x.report('squeeze:') { n.times{(asdf * str_x).squeeze(' ') }}
  x.report('gsub1:') { n.times{(asdf * str_x).gsub(/  +/, ' ') }}
  x.report('gsub2:') { n.times{(asdf * str_x).gsub(/ {2,}/, ' ') }}
  x.report('gsub3:') { n.times{(asdf * str_x).gsub(/\s\s+/, ' ') }}
  x.report('gsub4:') { n.times{(asdf * str_x).gsub(/\s{2,}/, ' ') }}
end
# >>               user     system      total        real
# >> squeeze:  1.050000   0.000000   1.050000 (  1.055833)
# >> gsub1:    3.700000   0.020000   3.720000 (  3.731957)
# >> gsub2:    3.960000   0.010000   3.970000 (  3.980328)
# >> gsub3:    4.520000   0.020000   4.540000 (  4.549919)
# >> gsub4:    4.840000   0.010000   4.850000 (  4.860474)
# >> 
# >> long strings
# >>               user     system      total        real
# >> squeeze:  0.310000   0.180000   0.490000 (  0.485224)
# >> gsub1:    3.420000   0.130000   3.550000 (  3.554505)
# >> gsub2:    3.850000   0.120000   3.970000 (  3.974213)
# >> gsub3:    4.880000   0.130000   5.010000 (  5.015750)
# >> gsub4:    5.310000   0.150000   5.460000 (  5.461797)

The tests are based on letting squeeze(' ') or gsub() strip the duplicated spaces. As I expected, squeeze(' ') blows away the regex. Regex using a space character are faster than the equivalent pattern using \s.

Of course the regex are more flexible but thinking about whether a regex is needed can make a big difference in the processing speed of your code.

Greg