views:

52

answers:

2

Hi,

Assuming I have a string "HET1200 text string" and I need it to change to "HET1200 Text String". Encoding would be UTF-8.

How can I do that? Currently, I use mb_convert_case($string, MB_CASE_TITLE, "UTF-8"); but that changes "HET1200" to "Het1200.

I could specify an exception, but it won't be an exhaustive. So I rather all uppercase words to remain uppercase.

Thanks :)

A: 

You can use ucwords() instead.

NullUserException
I've considered that. But I need to be able to support unicode characters..
Lyon
+2  A: 

OK, let's try to recreate mb_convert_case as close as possible but only changing the first character of every word.

The relevant part of mb_convert_case implementation is this:

int mode = 0; 

for (i = 0; i < unicode_len; i+=4) {
    int res = php_unicode_is_prop(
        BE_ARY_TO_UINT32(&unicode_ptr[i]),
        UC_MN|UC_ME|UC_CF|UC_LM|UC_SK|UC_LU|UC_LL|UC_LT|UC_PO|UC_OS, 0);
    if (mode) {
        if (res) {
            UINT32_TO_BE_ARY(&unicode_ptr[i],
                php_unicode_tolower(BE_ARY_TO_UINT32(&unicode_ptr[i]),
                    _src_encoding TSRMLS_CC));
        } else {
            mode = 0;
        }   
    } else {
        if (res) {
            mode = 1;
            UINT32_TO_BE_ARY(&unicode_ptr[i],
                php_unicode_totitle(BE_ARY_TO_UINT32(&unicode_ptr[i]),
                    _src_encoding TSRMLS_CC));
        }
    }
}

Basically, this does the following:

  • Set mode to 0. mode will determine whether we are in the first character of a word. If it's 0, we are, otherwise, we're not.
  • Iterate through the characters of string.
    • Determine what kind of character it is.
      • Set res to 1 if it's a word character. More specifically, set it to 1 if it has the property "Mark, Non-Spacing", "Mark, Enclosing", "Other, Format", "Letter, Modifier", "Symbol, Modifier", "Letter, Uppercase", "Letter, Lowercase", "Letter, Titlecase", "Punctuation, Other" or "Other, Surrogate". Oddly, "Letter, Other" is not included.
    • If we're not in the beginning of a word
      • If we're at a word character, convert it to lowercase – this is what we don't want.
      • Otherwise, we're not at a word character, and we set mode to 0 to signal we're moving to the beginning of a word.
    • If we're at the beggining of a word and we indeed have a word character
      • Convert this character to title case
      • Signal we're no longer at the beginning of a word.

The mbstring extension does not seem to expose the character properties. This leaves us with a problem, because we don't have a good way to determine if a character has any of the 10 properties for which mb_convert_case tests.

Fortunately, unicode character properties in regex can save us here.

A faithful reproduction of mb_convert_case with the problematic conversion to lowercase becomes:

function mb_convert_case_utf8_variation($s) {
    $arr = preg_split("//u", $s, -1, PREG_SPLIT_NO_EMPTY);
    $result = "";
    $mode = false;
    foreach ($arr as $char) {
        $res = preg_match(
            '/\\p{Mn}|\\p{Me}|\\p{Cf}|\\p{Lm}|\\p{Sk}|\\p{Lu}|\\p{Ll}|'.
            '\\p{Lt}|\\p{Sk}|\\p{Cs}/u', $char) == 1;
        if ($mode) {
            if (!$res)
                $mode = false;
        }
        elseif ($res) {
            $mode = true;
            $char = mb_convert_case($char, MB_CASE_TITLE, "UTF-8");
        }
        $result .= $char;
    }

    return $result;
}

Test:

echo mb_convert_case_utf8_variation("HETÁ1200 Ááxt ítring uii");

gives:

HETÁ1200 Ááxt Ítring Uii
Artefacto
thank you. this is ingenious! really appreciate your explanation too. :)
Lyon