views:

1923

answers:

9

Let's say I have a string such as:

"Hello how are you doing?"

I would like a function that turns multiple spaces into one space.

So i would get:

"Hello how are you doing?"

I know I could use regex or call

string s = "Hello     how are   you           doing?".replace("  "," ");

But I would have to call it multiple times to make sure all sequential whitespaces are replaced with only one.

Is there already a built in mehod for this in csharp?

+23  A: 

string cleanedString = System.Text.RegularExpressions.RegEx.Replace(s,@"\s+"," ");

Tim Hoolihan
Using a regular expression introduces a lot of overhead that isn't necessary.
Scott Dorman
imo, avoiding regex if your comfortable with them is premature optimization
Tim Hoolihan
If you application isn't time critical, it can afford the 1 microsecond of processing overhead.
Daniel
Note that '\s' not only replaces white spaces, but also new line characters.
Bart Kiers
good catch, if you just want spaces switch the pattern to "[ ]+"
Tim Hoolihan
+3  A: 

A regular expressoin would be the easiest way. If you write the regex the correct way, you wont need multiple calls.

Change it to this:

string s = System.Text.RegularExpressions.Regex.Replace(s, @"\s{2,}", " ");
Brandon
+2  A: 
Regex regex = new Regex(@"\W+");
string outputString = regex.Replace(inputString, " ");
Michael D.
A: 

There is no way built in to do this. You can try this:

private static readonly char[] whitespace = new char[] { ' ', '\n', '\t', '\r', '\f', '\v' };
public static string Normalize(string source)
{
   return String.Join(" ", source.Split(whitespace, StringSplitOptions.RemoveEmptyEntries));
}

This will remove leading and trailing whitespce as well as collapse any internal whitespace to a single whitespace character. If you really only want to collapse spaces, then the solutions using a regular expression are better; otherwise this solution is better. (See the analysis done by Jon Skeet.)

Scott Dorman
If the regular expression is compiled and cached, I'm not sure that has more overhead than splitting and joining, which could create *loads* of intermediate garbage strings. Have you done careful benchmarks of both approaches before assuming that your way is faster?
Jon Skeet
whitespace is undeclared here
Tim Hoolihan
Speaking of overhead, why on earth are you calling `source.ToCharArray()` and then throwing away the result?
Jon Skeet
*And* calling `ToCharArray()` on the result of string.Join, only to create a new string... wow, for that to be in a post complaining of overhead is just remarkable. -1.
Jon Skeet
Oh, and assuming `whitespace` is `new char[] { ' ' }`, this will give the wrong result if the input string starts or ends with a space.
Jon Skeet
No, I've not done benchmarks, but I know there is higher overhead for RegEx compared to the Split and Join. From what it looks like Split and Join either use character buffers, treat the string as an array of characters or go through unsafe code to do pointer manipulations.
Scott Dorman
grrr...copied from a larger example...updated to reflect the comments.
Scott Dorman
"Knowing" there's a higher overhead for regexes isn't nearly as good as proving it with benchmarks. I'm running benchmarks now, and will post results soon.
Jon Skeet
+3  A: 

While the existing answers are fine, I'd like to point out one approach which doesn't work:

public static string DontUseThisToCollapseSpaces(string text)
{
    while (text.IndexOf("  ") != -1)
    {
        text = text.Replace("  ", " ");
    }
    return text;
}

This can loop forever. Anyone care to guess why? (I only came across this when it was asked as a newsgroup question a few years ago... someone actually ran into it as a problem.)

Jon Skeet
I think I remember this question being asked awhile back on SO. IndexOf ignores certain characters that Replace doesn't. So the double space was always there, just never removed.
Brandon
It is because IndexOf ignores some Unicode characters, the specific culprate in this case being some asian character iirc. Hmm, zero-width non-joiner according to the Google.
Hawker
And Hawker gets the prize :)
Jon Skeet
+2  A: 

As already pointed out, this is easily done by a regular expression. I'll just add that you might want to add a .trim() to that to get rid of leading/trailing whitespace.

MAK
+6  A: 

This question isn't as simple as other posters have made it out to be (and as I originally believed it to be) - because the question isn't quite precise as it needs to be.

There's a difference between "space" and "whitespace". If you only mean spaces, then you should use a regex of " {2,}". If you mean any whitespace, that's a different matter. Should all whitespace be converted to spaces? What should happen to space at the start and end?

For the benchmark below, I've assumed that you only care about spaces, and you don't want to do anything to single spaces, even at the start and end.

Note that correctness is almost always more important than performance. The fact that the Split/Join solution removes any leading/trailing whitespace (even just single spaces) is incorrect as far as your specified requirements (which may be incomplete, of course).

The benchmark uses MiniBench.

using System;
using System.Text.RegularExpressions;
using MiniBench;

internal class Program
{
    public static void Main(string[] args)
    {

        int size = int.Parse(args[0]);
        int gapBetweenExtraSpaces = int.Parse(args[1]);

        char[] chars = new char[size];
        for (int i=0; i < size/2; i += 2)
        {
            // Make sure there actually *is* something to do
            chars[i*2] = (i % gapBetweenExtraSpaces == 1) ? ' ' : 'x';
            chars[i*2 + 1] = ' ';
        }
        // Just to make sure we don't have a \0 at the end
        // for odd sizes
        chars[chars.Length-1] = 'y';

        string bigString = new string(chars);
        // Assume that one form works :)
        string normalized = NormalizeWithSplitAndJoin(bigString);


        var suite = new TestSuite<string, string>("Normalize")
            .Plus(NormalizeWithSplitAndJoin)
            .Plus(NormalizeWithRegex)
            .RunTests(bigString, normalized);

        suite.Display(ResultColumns.All, suite.FindBest());
    }

    private static readonly Regex MultipleSpaces = 
        new Regex(@" {2,}", RegexOptions.Compiled);

    static string NormalizeWithRegex(string input)
    {
        return MultipleSpaces.Replace(input, " ");
    }

    // Guessing as the post doesn't specify what to use
    private static readonly char[] Whitespace =
        new char[] { ' ' };

    static string NormalizeWithSplitAndJoin(string input)
    {
        string[] split = input.Split
            (Whitespace, StringSplitOptions.RemoveEmptyEntries);
        return string.Join(" ", split);
    }
}

A few test runs:

c:\Users\Jon\Test>test 1000 50
============ Normalize ============
NormalizeWithSplitAndJoin  1159091 0:30.258 22.93
NormalizeWithRegex        26378882 0:30.025  1.00

c:\Users\Jon\Test>test 1000 5
============ Normalize ============
NormalizeWithSplitAndJoin  947540 0:30.013 1.07
NormalizeWithRegex        1003862 0:29.610 1.00


c:\Users\Jon\Test>test 1000 1001
============ Normalize ============
NormalizeWithSplitAndJoin  1156299 0:29.898 21.99
NormalizeWithRegex        23243802 0:27.335  1.00

Here the first number is the number of iterations, the second is the time taken, and the third is a scaled score with 1.0 being the best.

That shows that in at least some cases (including this one) a regular expression can outperform the Split/Join solution, sometimes by a very significant margin.

However, if you change to an "all whitespace" requirement, then Split/Join does appear to win. As is so often the case, the devil is in the detail...

Jon Skeet
Great analysis. So it appears that we were both correct to varying degrees. The code in my answer was taken from a larger function which has the ability to normalize all whitespace and/or control characters from within a string and from the beginning and end.
Scott Dorman
With just the whitespace characters you specified, in most of my tests the regex and Split/Join were about equal - S/J had a tiny, tiny benefit, at the cost of correctness and complexity. For those reasons, I'd normally prefer the regex. Don't get me wrong - I'm far from a regex fanboy, but I don't like writing more complex code for the sake of performance without really testing the performance first.
Jon Skeet
A: 

Smallest solution:

var regExp=/\s+/g, newString=oldString.replace(regExp,' ');

sycoted
+1  A: 

I'm sharing what I use, because it appears I've come up with something different. I've been using this for a while and it is fast enough for me. I'm not sure how it stacks up against the others. I uses it in a delimited file writer and run large datatables one field at a time through it.

    public static string NormalizeWhiteSpace(string S)
    {
        string s = S.Trim();
        bool iswhite = false;
        int iwhite;
        int sLength = s.Length;
        StringBuilder sb = new StringBuilder(sLength);
        foreach(char c in s.ToCharArray())
        {
            if(Char.IsWhiteSpace(c))
            {
                if (iswhite)
                {
                    //Continuing whitespace ignore it.
                    continue;
                }
                else
                {
                    //New WhiteSpace

                    //Replace whitespace with a single space.
                    sb.Append(" ");
                    //Set iswhite to True and any following whitespace will be ignored
                    iswhite = true;
                }  
            }
            else
            {
                sb.Append(c.ToString());
                //reset iswhitespace to false
                iswhite = false;
            }
        }
        return sb.ToString();
    }