ansaurus

Question

Formatting sentences in a string using C#

Answer 1

+1 A:

Do your work in a StringBuffer.
Lowercase the whole thing.
Loop through and uppercase leading chars.
Call ToString.

Steven Sudit 2010-01-25 21:43:45

This may have the unintended consequence of lower-casing other text that should remain in uppercase, like names, for instance.

LBushkin 2010-01-25 21:45:46

@LBushkin: Then skip step 2 if you're sure it's ok to do so.

Steven Sudit 2010-01-25 21:47:18

Since there aren't many places to capitalize, one probably could use a StringBuilder effectively, but the devil is in the details. Care to paste in code?

Hamish Grubijan 2010-01-25 21:56:23

It would be fun to write this up, but I'm not at liberty to do so at the moment. Sorry.

Steven Sudit 2010-01-25 23:19:05

Answer 2

+4 A:

You have a few different options:

Your approach of splitting the string, capitalizing and then re-joining
Using regular expressions to perform a replace of the expressions (which can be a bit tricky for case)
Write a C# iterator that iterates over each character and yields a new IEnumerable<char> with the first letter after a period in upper case. May offer benefit of a streaming solution.
Loop over each char and upper-case those that appear immediately after a period (whitespace ignored) - a StringBuffer may make this easier.

The code below uses an iterator:

public static string ToSentenceCase( string someString )
{
  var sb = new StringBuilder( someString.Length );
  bool wasPeriodLastSeen = true; // We want first letter to be capitalized
  foreach( var c in someString )
  {
      if( wasPeriodLastSeen && !c.IsWhiteSpace ) 
      {
          sb.Append( c.ToUpper() );
          wasPeriodLastSeen = false;         
      }        
      else
      {
          if( c == '.' )  // you may want to expand this to other punctuation
              wasPeriodLastSeen = true;
          sb.Append( c );
      }
  }

  return sb.ToString();
}

LBushkin 2010-01-25 21:45:09

Is performance a consideration?

Steven Sudit 2010-01-25 21:48:06

LBushkin: ToTitleCase will capitalize first letter of every word of the string. In my case the output will be "This Is Some Code. The Code Is In C#".

Yogendra 2010-01-25 21:51:16

Steven : The performance is an issue because the method is called in a loop.

Yogendra 2010-01-25 21:52:02

You are correct, I reviewed the documentation and it is per-word. I will update my post to reflect a correct implementation.

LBushkin 2010-01-25 21:54:16

Answer 3

+2 A:

I don't know why, but I decided to give yield return a try, based on what LBushkin had suggested. Just for fun.

static IEnumerable<char> CapitalLetters(string sentence)
        {
            //capitalize first letter
            bool capitalize = true;
            char lastLetter;
            for (int i = 0; i < sentence.Length; i++)
            {
                lastLetter = sentence[i];
                yield return (capitalize) ? Char.ToUpper(sentence[i]) : sentence[i];


                if (Char.IsWhiteSpace(lastLetter) && capitalize == true)
                    continue;

                capitalize = false;
                if (lastLetter == '.' || lastLetter == '!') //etc
                    capitalize = true;
            }
        }

To use it:

string sentence = new String(CapitalLetters("this is some code. the code is in C#.").ToArray());

Stan R. 2010-01-25 22:15:18

Answer 4

+3 A:

In my opinion, when it comes to potentially complex rules-based string matching and replacing - you can't get much better than a Regex-based solution (despite the fact that they are so hard to read!). This offers the best performance and memory efficiency, in my opinion - you'll be surprised at just how fast this'll be.

I'd use the Regex.Replace overload that accepts an input string, regex pattern and a MatchEvaluator delegate. A MatchEvaluator is a function that accepts a Match object as input and returns a string replacement.

Here's the code:

public static string Capitalise(string input)
{
  //now the first character
  return Regex.Replace(input, @"(?<=(^|[.;:])\s*)[a-z]",
    (match) => { return match.Value.ToUpper(); });
}

The regex uses the (?<=) construct (zero-width positive lookbehind) to restrict captures only to a-z characters preceded by the start of the string, or the punctuation marks you want. In the [.;:] bit you can add the extra ones you want (e.g. [.;:?."] to add ? and " characters.

This means, also, that your MatchEvaluator doesn't have to do any unnecessary string joining (which you want to avoid for performance reasons).

All the other stuff mentioned by one of the other answerers about using the RegexOptions.Compiled is also relevant from a performance point of view. The static Regex.Replace method does offer very similar performance benefits, though (there's just an additional dictionary lookup).

Like I say - I'll be surprised if any of the other non-regex solutions here will work better and be as fast.

EDIT

Have put this solution up against Ahmad's as he quite rightly pointed out that a look-around might be less efficient than doing it his way.

Here's the crude benchmark I did:

public string LowerCaseLipsum
{
  get
  {
    //went to lipsum.com and generated 10 paragraphs of lipsum
    //which I then initialised into the backing field with @"[lipsumtext]".ToLower()
    return _lowerCaseLipsum;
  }
 }
 [TestMethod]
 public void CapitaliseAhmadsWay()
 {
   List<string> results = new List<string>();
   DateTime start = DateTime.Now;
   Regex r = new Regex(@"(^|\p{P}\s+)(\w+)", RegexOptions.Compiled);
   for (int f = 0; f < 1000; f++)
   {
     results.Add(r.Replace(LowerCaseLipsum, m => m.Groups[1].Value
                      + m.Groups[2].Value.Substring(0, 1).ToUpper()
                           + m.Groups[2].Value.Substring(1)));
   }
   TimeSpan duration = DateTime.Now - start;
   Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
 }

 [TestMethod]
 public void CapitaliseLookAroundWay()
 {
   List<string> results = new List<string>();
   DateTime start = DateTime.Now;
   Regex r = new Regex(@"(?<=(^|[.;:])\s*)[a-z]", RegexOptions.Compiled);
   for (int f = 0; f < 1000; f++)
   {
     results.Add(r.Replace(LowerCaseLipsum, m => m.Value.ToUpper()));
   }
   TimeSpan duration = DateTime.Now - start;
   Console.WriteLine("Operation took {0} seconds", duration.TotalSeconds);
 }

In a release build, the my solution was about 12% faster than the Ahmad's (1.48 seconds as opposed to 1.68 seconds).

Interestingly, however, if it was done through the static Regex.Replace method, both were about 80% slower, and my solution was slower than Ahmad's.

Andras Zoltan 2010-01-25 22:16:50

My suspicion here is that, even with a precompiled regexp, it's not going to be as fast as StringBuilder.

Steven Sudit 2010-01-25 23:19:57

Regex uses Stringbuilder internally anyway - but I guess the only way to find out is to benchmark the different solutions. Until we do that anything else is pure conjecture :)

Andras Zoltan 2010-01-26 08:15:19

Andras : Thanks for the answer. It will not work in case we have punctuations like '?'. I guess Ahmad's answer below comes close. I am yet to fully evaluate it.

Yogendra 2010-01-26 14:29:27

Yogendra : Sure it will - just add the question mark to the [.;:] bit in the regex - i.e. change it to '[.;:?]'. Indeed, you can add all the individual punctuation marks you need to trap inside those two square brackets. I've edited the answer as well - because the '.' doesn't need the leading '\' when inside the [].

Andras Zoltan 2010-01-26 15:21:12

I should say that you can use this regex the same as Ahmad's - just substitute the [*punctuation_characters*] block with the punctuation character class. Outside of that, the structure of this regex is better because it doesn't require 'a + b + c' and SubString operations in the MatchEvaluator and therefore will be much faster.

Andras Zoltan 2010-01-26 15:26:35

@Andras: did you benchmark them? :) I wonder how they would compare given the look-around usage. Nice to see another regex solution though!

Ahmad Mageed 2010-01-26 15:42:01

+1 sorry I didn't have time to setup a benchmark myself right now but thanks for going the extra mile.

Ahmad Mageed 2010-01-26 16:26:47

Thanks, too - for making me think more about the look-around solution; also for now having discovered that static regex.replace - despite being compiled and cached - is significantly slower compared to a compiled regex instance!

Andras Zoltan 2010-01-26 16:48:25

Answer 5

+2 A:

Here's a regex solution that uses the punctuation category to avoid having to specify .!?" etc. although you should certainly check if it covers your needs or set them explicitly. Read up on the "P" category under the "Supported Unicode General Categories" section located on the MSDN Character Classes page.

string input = @"this is some code. the code is in C#? it's great! In ""quotes."" after quotes.";
string pattern = @"(^|\p{P}\s+)(\w+)";

// compiled for performance (might want to benchmark it for your loop)
Regex rx = new Regex(pattern, RegexOptions.Compiled);

string result = rx.Replace(input, m => m.Groups[1].Value
                                + m.Groups[2].Value.Substring(0, 1).ToUpper()
                                + m.Groups[2].Value.Substring(1));

If you decide not to use the \p{P} class you would have to specify the characters yourself, similar to:

string pattern = @"(^|[.?!""]\s+)(\w+)";

EDIT: below is an updated example to demonstrate 3 patterns. The first shows how all punctuations affect casing. The second shows how to pick and choose certain punctuation categories by using class subtraction. It uses all punctuations while removing specific punctuation groups. The third is similar to the 2nd but using different groups.

The MSDN link doesn't spell out what some of the punctuation categories refer to, so here's a breakdown:

P: all punctuations (comprises all of the categories below)
Pc: underscore _
Pd: dash -
Ps: open parenthesis, brackets and braces ( [ {
Pe: closing parenthesis, brackets and braces ) ] }
Pi: initial single/double quotes (MSDN says it "may behave like Ps/Pe depending on usage")
Pf: final single/double quotes (MSDN Pi note applies)
Po: other punctuation such as commas, colons, semi-colons and slashes ,, :, ;, \, /

Carefully compare how the results are affected by these groups. This should grant you a great degree of flexibility. If this doesn't seem desirable then you may use specific characters in a character class as shown earlier.

string input = @"foo ( parens ) bar { braces } foo [ brackets ] bar. single ' quote & "" double "" quote.
dash - test. Connector _ test. Comma, test. Semicolon; test. Colon: test. Slash / test. Slash \ test.";

string[] patterns = { 
    @"(^|\p{P}\s+)(\w+)", // all punctuation chars
    @"(^|[\p{P}-[\p{Pc}\p{Pd}\p{Ps}\p{Pe}]]\s+)(\w+)", // all punctuation chars except Pc/Pd/Ps/Pe
    @"(^|[\p{P}-[\p{Po}]]\s+)(\w+)" // all punctuation chars except Po
};

// compiled for performance (might want to benchmark it for your loop)
foreach (string pattern in patterns)
{
    Console.WriteLine("*** Current pattern: {0}", pattern);
    string result = Regex.Replace(input, pattern,
                            m => m.Groups[1].Value
                                 + m.Groups[2].Value.Substring(0, 1).ToUpper()
                                 + m.Groups[2].Value.Substring(1));
    Console.WriteLine(result);
    Console.WriteLine();
}

Notice that "Dash" is not capitalized using the last pattern and it's on a new line. One way to make it capitalized is to use the RegexOptions.Multiline option. Try the above snippet with that to see if it meets your desired result.

Also, for the sake of example, I didn't use RegexOptions.Compiled in the above loop. To use both options OR them together: RegexOptions.Compiled | RegexOptions.Multiline.

Ahmad Mageed 2010-01-25 22:24:07

+1 - Good catch with the punctuation character class - but having all those string additions and substrings in the MatchEvaluator is not getting best out of the StringBuilder that the Regex.Replace operation is going to be using. My solution uses zero-width captures for the bits that identify the 'first' character, meaning the OP simply returns match.Value.ToUpper().

Andras Zoltan 2010-01-26 15:30:24

Ahmad - as you suggested - have added a crude, but fair I think, benchmark to my answer. My one is faster when both regexes are compiled into Regex instances with RegexOptions.Compiled. Your one is faster when using the Regex.Replace static method - but the performance hit on both when doing it that way is also quite heavy (I'll never use the static method again!) :)

Andras Zoltan 2010-01-26 16:10:58

@Andras thanks for the followup!

Ahmad Mageed 2010-01-26 16:25:25

ansaurus

tags:

views:

answers:

Formatting sentences in a string using C#

related questions