How can you strip non-ASCII characters from a string? (in C#)
+24
A:
string s = "søme string"; s = Regex.Replace(s, @"[^\u0000-\u007F]", "");
philcruz
2008-09-23 19:46:24
For those of us RegEx'd challenged, would you mind writing out in plain english your RegEx pattern. In other words, "the ^ does this", etc...
Metro Smurf
2008-09-23 22:45:15
@Metro Smurfthe ^ is the not operator. It tells the regex to find everything that doesn't match, instead of everything that does match.The \u####-\u#### says which characters match.\u0000-\u007F is the equivilent of the first 255 characters in utf-8 or unicode, which are always the ascii characters. So you match every non ascii character (because of the not) and do a replace on everything that matches.
Gordon Tucker
2009-12-11 21:11:26
not 255, 127.. sorry bout that :)
Gordon Tucker
2009-12-11 21:12:13
+12
A:
Here is a pure .NET solution that doesn't use regular expressions:
string inputString = "Räksmörgås";
string asAscii = Encoding.ASCII.GetString(
Encoding.Convert(
Encoding.UTF8,
Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
new EncoderReplacementFallback(string.Empty),
new DecoderExceptionFallback()
),
Encoding.UTF8.GetBytes(inputString)
)
);
It may look cumbersome, but it should be intuitive. It uses the .NET ASCII encoding to convert a string. UTF8 is used during the conversion because it can represent any of the original characters. It uses an EncoderReplacementFallback to to convert any non-ASCII character to an empty string.
bzlm
2008-09-25 19:32:16
Perfect! I'm using this to clean a string before saving it to a RTF document. Very much appreciated. Much easier to understand than the Regex version.
Nathan Prather
2009-10-06 16:48:26
You really find it easier to understand? To me, all the stuff that's not really relevant (fallbacks, conversions to bytes etc) is drawing the attention away from what actually happens.
bzlm
2009-10-11 15:28:54
+2
A:
Inspired by philcruz's Regular Expression solution, I've made a pure LINQ solution
public static string PureAscii(this string source, char nil = ' ')
{
var min = '\u0000';
var max = '\u007F';
return source.Select(c => c < min ? nil : c > max ? nil : c).ToText();
}
public static string ToText(this IEnumerable<char> source)
{
var buffer = new StringBuilder();
foreach (var c in source)
buffer.Append(c);
return buffer.ToString();
}
This is untested code.
Bent Rasmussen
2010-01-27 19:00:39
For those who didn't catch it, this is a C# 4.0 LINQ-based solution. :)
Ryan Riley
2010-01-28 20:49:59