tags:

views:

271

answers:

5

Hi, all.

I need to replace multiple whitespaces into a single whitespace (per iteration) in a document. Doesn't matter whether they are spaces, tabs or newlines, any combination of any kind of whitespace needs to be truncated to a single whitespace.

Let's say we have the string: "Hello,\t \t\n  \t    \n world", (where \t and \n represent tabs and newlines respectively) then I'd need it to become "Hello, world".

I'm so completely bewildered by regex more generally that I ended up just asking.

Considerations:

  • I have no control over the document, since it could be any document on the internet.

  • I'm using C#, so if anyone knows how to do this in C# specifically, that would be even more awesome.

  • I don't really have to use regex (before someone asks), but I figured it's probably the optimal way, since regex is designed for this sort of stuff, and my own strpos/str_replace/substr soup would probably not perform as well. Performance is important on this one so what I'm essentially looking for is an efficient way to do this to any random text file on the internet (remember, I can't predict the size!).

Thanks in advance! - Helgi

+3  A: 

You may find this SO answer useful:

http://stackoverflow.com/questions/206717/how-do-i-replace-multiple-spaces-with-a-single-space-in-c

Adapting the answer to also replace tabs and newlines as well is relatively straight forward:

RegexOptions options = RegexOptions.None;
Regex regex = new Regex(@"\s+", options);     
tempo = regex.Replace(tempo, @" ");
LBushkin
Check out the answer by Matt in the above link as the accepted solution looks like it only replaces the space character, not newlines and tabs. The '\s' in the pattern is what tells it to match on any whitespace character.
TLiebe
I looked before I asked, I swear, I looked!Thanks a bunch, that helped me out. :)
Helgi Hrafn Gunnarsson
TLiebe: Yeah, I did, thanks.
Helgi Hrafn Gunnarsson
+11  A: 
newString = Regex.Replace(oldString, @"\s+", " ");

The "\s" is a regex character class for any whitespace character, and the + means "one or more". It replaces each occurence with a simple space character.

womp
A: 
I would suggest you replace your chomp with
 $line =~ s/\s+$//;

which will strip off all trailing white spaces - tabs, spaces, new lines and returns as well.

Taken from: http://www.wellho.net/forum/Perl-Programming/New-line-characters-beware.html

I'm aware its Perl, but it should be helpful enough for you.

Woot4Moo
A: 

As someone who sympathizes with Jamie Zawinski's position on Regex, I'll offer an alternative for what it's worth.

Not wanting to be religious about it, but I'd say it's faster than Regex, though whether you'll ever be processing strings long enough to see the difference is another matter.

    public static string CompressWhiteSpace(string value)
    {
        if (value == null) return null;

        bool inWhiteSpace = false;
        StringBuilder builder = new StringBuilder(value.Length);

        foreach (char c in value)
        {
            if (Char.IsWhiteSpace(c))
            {
                inWhiteSpace = true;
            }
            else
            {
                if (inWhiteSpace) builder.Append(' ');
                inWhiteSpace = false;
                builder.Append(c);
            }
        }
        return builder.ToString();
    }
Joe
A: 

Actually I think an extension method would probably be more efficient as you don't have the state machine overhead of the regex. Essentially, it becomes a very specialized pattern matcher.

public static string Collapse( this string source )
{
    if (string.IsNullOrEmpty( source ))
    {
        return source;
    }

    StringBuilder builder = new StringBuilder();
    bool inWhiteSpace = false;
    bool sawFirst = false;
    foreach (var c in source)
    {
        if (char.IsWhiteSpace(c))
        {
            inWhiteSpace = true;
        }
        else
        {
            // only output a whitespace if followed by non-whitespace
            // except at the beginning of the string
            if (inWhiteSpace && sawFirst)
            {
                builder.Append(" ");
            }
            inWhiteSpace = false;
            sawFirst = true;
            builder.Append(c);
        }
    }
    return builder.ToString();
}
tvanfosson