tags:

views:

8592

answers:

3

How can you strip non-ASCII characters from a string? (in C#)

+24  A: 

   string s = "søme string";
   s = Regex.Replace(s, @"[^\u0000-\u007F]", "");
   

philcruz
For those of us RegEx'd challenged, would you mind writing out in plain english your RegEx pattern. In other words, "the ^ does this", etc...
Metro Smurf
@Metro Smurfthe ^ is the not operator. It tells the regex to find everything that doesn't match, instead of everything that does match.The \u####-\u#### says which characters match.\u0000-\u007F is the equivilent of the first 255 characters in utf-8 or unicode, which are always the ascii characters. So you match every non ascii character (because of the not) and do a replace on everything that matches.
Gordon Tucker
not 255, 127.. sorry bout that :)
Gordon Tucker
+12  A: 

Here is a pure .NET solution that doesn't use regular expressions:

        string inputString = "Räksmörgås";
        string asAscii = Encoding.ASCII.GetString(
            Encoding.Convert(
                Encoding.UTF8,
                Encoding.GetEncoding(
                    Encoding.ASCII.EncodingName,
                    new EncoderReplacementFallback(string.Empty),
                    new DecoderExceptionFallback()
                    ),
                Encoding.UTF8.GetBytes(inputString)
            )
        );

It may look cumbersome, but it should be intuitive. It uses the .NET ASCII encoding to convert a string. UTF8 is used during the conversion because it can represent any of the original characters. It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.

bzlm
Perfect! I'm using this to clean a string before saving it to a RTF document. Very much appreciated. Much easier to understand than the Regex version.
Nathan Prather
You really find it easier to understand? To me, all the stuff that's not really relevant (fallbacks, conversions to bytes etc) is drawing the attention away from what actually happens.
bzlm
+2  A: 

Inspired by philcruz's Regular Expression solution, I've made a pure LINQ solution

    public static string PureAscii(this string source, char nil = ' ')
    {
        var min = '\u0000';
        var max = '\u007F';
        return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
    }

    public static string ToText(this IEnumerable<char> source)
    {
        var buffer = new StringBuilder();
        foreach (var c in source)
            buffer.Append(c);
        return buffer.ToString();
    }

This is untested code.

Bent Rasmussen
For those who didn't catch it, this is a C# 4.0 LINQ-based solution. :)
Ryan Riley