views:

6750

answers:

9

For the hope-to-have-an-answer-in-30-seconds part of this question, I'm specifically looking for C#

But in the general case, what's the best way to strip punctuation in any language?

I should add: Ideally, the solutions won't require you to enumerate all the possible punctuation marks.

Related: Strip Punctuation in Python

+3  A: 

The most braindead simple way of doing it would be using string.replace

The other way I would imagine is a regex.replace and have your regular expression with all the appropriate punctuation marks in it.

TheTXI
+2  A: 

Assuming "best" means "simplest" I suggest using something like this:

String stripped = input.replaceAll("\\p{Punct}+", "");

This example is for Java, but all sufficiently modern Regex engines should support this (or something similar).

Edit: the Unicode-Aware version would be this:

String stripped = input.replaceAll("\\p{P}+", "");

The first version only looks at punctuation characters contained in ASCII.

Joachim Sauer
+8  A: 

new string(myCharCollection.Where(c => !char.IsPunctuation(c)));

GWLlosa
Yup. It's powering the string operation I posted below.
Tom Ritter
+2  A: 

You can use the regex.replace method:

 replace(YourString, RegularExpressionWithPunctuationMarks, Empty String)

Since this returns a string, your method will look something like this:

 string s = Regex.Replace("Hello!?!?!?!", "[?!]", "");

You can replace "[?!]" with something more sophiticated if you want:

(\p{P})

This should find any punctuation.

Anton
+1  A: 

Based off GWLlosa's idea, I was able to come up with the supremely ugly, but working:

string s = "cat!";
s = s.ToCharArray().ToList<char>()
      .Where<char>(x => !char.IsPunctuation(x))
      .Aggregate<char, string>(string.Empty, new Func<string, char, string>(
             delegate(string s, char c) { return s + c; }));
Tom Ritter
I know; right? I hobby of mine is committing sins against code in Linq. But please, by all means, make it better.
Tom Ritter
+1  A: 

Here's a slightly different approach using linq. I like AviewAnew's but this avoids the Aggregate

        string myStr = "Hello there..';,]';';., Get rid of Punction";

        var s = from ch in myStr
                where !Char.IsPunctuation(ch)
                select ch;

        var bytes = UnicodeEncoding.ASCII.GetBytes(s.ToArray());
        var stringResult = UnicodeEncoding.ASCII.GetString(bytes);
JoshBerke
+4  A: 

Why not simply:

string s = "sxrdct?fvzguh,bij.";
var sb = new StringBuilder();

foreach (char c in s)
{
   if (!char.IsPunctuation(c))
      sb.Append(c);
}

s = sb.ToString();

The usage of RegEx is normally slower than simple char operations. And those LINQ operations look like overkill to me. And you can't use such code in .NET 2.0...

Hades32
A: 
#include<string>
    #include<cctype>
    using namespace std;

    int main(int a, char* b[]){
    string strOne = "H,e.l/l!o W#o@r^l&d!!!";
    int punct_count = 0;

cout<<"before : "<<strOne<<endl;
for(string::size_type ix = 0 ;ix < strOne.size();++ix)   
{   
 if(ispunct(strOne[ix])) 
 {
      ++punct_count;  
      strOne.erase(ix,1); 
      ix--;
 }//if
}
    cout<<"after : "<<strOne<<endl;
                  return 0;
    }//main
+1  A: 

Fastest and easiest to read (IMHO):

 s.StripPunctuation();

to implement:

public static class StringExtension
{
    public static string StripPunctuation(this string s)
    {
        var sb = new StringBuilder();
        foreach (char c in s)
        {
            if (!char.IsPunctuation(c))
                sb.Append(c);
        }
        return sb.ToString();
    }
}

I tested several of the ideas posted here. Hades32's solution was the fastest (the stringbuilder with a foreach loop).

stringbuilder with foreach ( 1059 ms )
stringbuilder with foreach wrapped in extension ( 1056 ms )
stringbuilder with for loop ( 1061 ms )
string concat with foreach ( 2254 ms )
where with new string ( 1333 ms )
where with aggregate ( 2884 ms )
compiled regex ( 2481 ms )

This isn't a very realistic benchmark. Here is the code if you'd like to improve:

    [Test]
    public void MeasureStripPunctionationTest()
    {
        Measure("stringbuilder with foreach", s =>
                                                  {
                                                      var sb = new StringBuilder();
                                                      foreach (char c in s)
                                                      {
                                                          if (!char.IsPunctuation(c))
                                                              sb.Append(c);
                                                      }
                                                      return sb.ToString();
                                                  });


        Measure("stringbuilder with foreach wrapped in extension", s =>
                                                                       {
                                                                           var sb = new StringBuilder();
                                                                           foreach (char c in s)
                                                                           {
                                                                               if (!char.IsPunctuation(c))
                                                                                   sb.Append(c);
                                                                           }
                                                                           return sb.ToString();
                                                                       });


        Measure("stringbuilder with for", s =>
                                              {
                                                  var sb = new StringBuilder();
                                                  for (int i = 0; i < s.Length; i++)
                                                  {
                                                      if (!char.IsPunctuation(s[i]))
                                                          sb.Append(s[i]);
                                                  }
                                                  return sb.ToString();
                                              });

        Measure("string concat with foreach", s =>
                                                  {
                                                      var result = "";
                                                      foreach (char c in s)
                                                      {
                                                          if (!char.IsPunctuation(c))
                                                              result += c;
                                                      }
                                                      return result;
                                                  });

        Measure("where with new string", s => new string(s.Where(item => !char.IsPunctuation(item)).ToArray()));

        Measure("where with aggregate", s => s.Where(item => !char.IsPunctuation(item))
                                                 .Aggregate(string.Empty, (result, c) => result + c));

        var stripRegex = new Regex(@"\p{P}+", RegexOptions.Compiled);
        Measure("compiled regex", s => stripRegex.Replace(s, ""));
    }

    private void Measure(string name, Func<string, string> stripPunctation)
    {
        using (new PerformanceTimer(name))
        {
            var s = "a !@#$ short >{}*' string";
            for (int i = 0; i < 1000000; i++)
            {
                var withoutPunctuation = stripPunctation(s);
            }
        }
    }
Brian
interesting tidbit: the following are not punctuation: $^+|<>=
Brian