views:

1468

answers:

8

Does anyone have a trusted Proper Case or PCase algorithm (similar to a UCase or Upper)? I'm looking for something that takes a value such as "GEORGE BURDELL" or "george burdell" and turns it into "George Burdell".

I have a simple one that handles the simple cases. The ideal would be to have something that can handle things such as "O'REILLY" and turn it into "O'Reilly", but I know that is tougher.

I'm mainly focused on the English language if that simplifies things.

Update: I'm using C# as the language, but I can convert from almost anything (assuming like functionality exists).

I agree that the McDonald's scneario is a tough one. I meant to mention that along with my O'Reilly example, but did not in the original post.

+1  A: 

What programming language do you use? Many languages allow callback functions for regular expression matches. These can be used to propercase the match easily. The regular expression that would be used is quite simple, you just have to match all word characters, like so:

/\w+/

Alternatively, you can already extract the first character to be an extra match:

/(\w)(\w*)/

Now you can access the first character and successive characters in the match separately. The callback function can then simply return a concatenation of the hits. In pseudo Python (I don't actually know Python):

def make_proper(match):
    return match[1].to_upper + match[2]

Incidentally, this would also handle the case of “O'Reilly” because “O” and “Reilly” would be matched separately and both propercased. There are however other special cases that are not handled well by the algorithm, e.g. “McDonald's” or generally any apostrophed word. The algorithm would produce “Mcdonald'S” for the latter. A special handling for apostrophe could be implemented but that would interfere with the first case. Finding a thereotical perfect solution isn't possible. In practice, it might help considering the length of the part after the apostrophe.

Konrad Rudolph
A: 

a simple way to capitalise the first letter of each word (seperated by a space)

$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {
$s = strtolower($words[$i]);
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
$result .= “$s “;
}
$string = trim($result);

in terms of catching the "O'REILLY" example you gave splitting the string on both spaces and ' would not work as it would capitalise any letter that appeared after a apostraphe i.e. the s in Fred's

so i would probably try something like

$words = explode(” “, $string);
for ($i=0; $i<count($words); $i++) {

$s = strtolower($words[$i]);

if (substr($s, 0, 2) === "o'"){
$s = substr_replace($s, strtoupper(substr($s, 0, 3)), 0, 3);
}else{
$s = substr_replace($s, strtoupper(substr($s, 0, 1)), 0, 1);
}
$result .= “$s “;
}
$string = trim($result);

This should catch O'Reilly, O'Clock, O'Donnell etc hope it helps

Please note this code is untested.

JimmyJ
A: 

You do not mention which language you would like the solution in so here is some pseudo code.

Loop through each character
    If the previous character was an alphabet letter
        Make the character lower case
    Otherwise
        Make the character upper case
End loop
GateKiller
+2  A: 

There's also this neat Perl script for title-casing text.

http://daringfireball.net/2008/08/title_case_update

But it sounds like by proper case you mean.. for people's names only.

Jeff Atwood
+15  A: 

Unless I've misunderstood your question I don't think you need to roll your own, the TextInfo class can do it for you.

CultureInfo.InvariantCulture.TextInfo.ToTitleCase("GeOrGE bUrdEll")

Will return "George Burdell. And you can use your own culture if there's some special rules involved.

Update: Michael (in a comment to this answer) pointed out that this will not work if the input is all caps since the method will assume that it is an acronym. The naive workaround for this is to .ToLower() the text before submitting it to ToTitleCase.

Markus Olsson
Actually, this is incorrect. You example will return "GEORGE BURDELL"From the docs :Generally, title casing converts the first character of a word to uppercase and the rest of the characters to lowercase. However, a word that is entirely uppercase, such as an acronym, is not converted.
Michael Wolfenden
@Michael: Right you are... I guess the simple way of avoiding that would be to ensure that the input is lower-cased to begin with. I will update my answer to reflect this.
Markus Olsson
The InvariantCulture is used for operations that require a cultural component but which do not match any actual human culture. Since the original poster is focused on an actual human language (English), it is necessary to use a culture object that is set to English.
Windows programmer
A: 

Here's a perhaps naive C# implementation:-

public class ProperCaseHelper {
  public string ToProperCase(string input) {
    string ret = string.Empty;

    var words = input.Split(' ');

    for (int i = 0; i < words.Length; ++i) {
      ret += wordToProperCase(words[i]);
      if (i < words.Length - 1) ret += " ";
    }

    return ret;
  }

  private string wordToProperCase(string word) {
    if (string.IsNullOrEmpty(word)) return word;

    // Standard case
    string ret = capitaliseFirstLetter(word);

    // Special cases:
    ret = properSuffix(ret, "'");
    ret = properSuffix(ret, ".");
    ret = properSuffix(ret, "Mc");
    ret = properSuffix(ret, "Mac");

    return ret;
  }

  private string properSuffix(string word, string prefix) {
    if(string.IsNullOrEmpty(word)) return word;

    string lowerWord = word.ToLower(), lowerPrefix = prefix.ToLower();
    if (!lowerWord.Contains(lowerPrefix)) return word;

    int index = lowerWord.IndexOf(lowerPrefix);

    // If the search string is at the end of the word ignore.
    if (index + prefix.Length == word.Length) return word;

    return word.Substring(0, index) + prefix +
      capitaliseFirstLetter(word.Substring(index + prefix.Length));
  }

  private string capitaliseFirstLetter(string word) {
    return char.ToUpper(word[0]) + word.Substring(1).ToLower();
  }
}
kronoz
A: 

Kronoz, thank you. I found in your function that the line:

`if (!lowerWord.Contains(lowerPrefix)) return word`;

must say

if (!lowerWord.StartsWith(lowerPrefix)) return word;

so "información" is not changed to "InforMacIón"

best,

Enrique

A: 

I use this as the textchanged event handler of text boxes. Support entry of "McDonald"

Public Shared Function DoProperCaseConvert(ByVal str As String, Optional ByVal allowCapital As Boolean = True) As String
    Dim strCon As String = ""
    Dim wordbreak As String = " ,.1234567890;/\-()#$%^&*€!~+=@"
    Dim nextShouldBeCapital As Boolean = True

    'Improve to recognize all caps input
    'If str.Equals(str.ToUpper) Then
    '    str = str.ToLower
    'End If

    For Each s As Char In str.ToCharArray

        If allowCapital Then
            strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s)
        Else
            strCon = strCon & If(nextShouldBeCapital, s.ToString.ToUpper, s.ToLower)
        End If

        If wordbreak.Contains(s.ToString) Then
            nextShouldBeCapital = True
        Else
            nextShouldBeCapital = False
        End If
    Next

    Return strCon
End Function
Dasiths
Is there a reason for word breaks to include Mexican pesos, American dollars, and Irish euro, but not English pounds? Is there a reason for word breaks not to include underscores?
Windows programmer
simply NO. You can put any of those characters there in the array. Although if it's sarcasm you are after I don't think it can be put in there.
Dasiths