views:

440

answers:

7

How would you write ToUpper() if it didn't exist? Bonus points for i18n and L10n

Curiosity sparked by this: http://thedailywtf.com/Articles/The-Long-Way-toUpper.aspx

+6  A: 

I dont think SO can handle the size of the unicode tables in a single posting :)

Unfortunately, it is not so easy as just char.ToUpper() every character.

Example:

(string-upcase "Straße")    ⇒ "STRASSE"
(string-downcase "Straße")  ⇒ "straße"
(string-upcase "ΧΑΟΣ")      ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ")    ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ")   ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ")  ⇒ "χαος σ"
(string-upcase "χαος")      ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ")      ⇒ "ΧΑΟΣ"
leppie
(string-upcase "Straße") ⇒ "STRAẞE"
hangy
Hangy, sorry, that does not render. Also my conversions are local-independent (guess I should have mentioned that ;p).
leppie
And I simply pasted from the R6RS Scheme spec, it could be a typo, will check the tests.
leppie
Seems to be correct. These Scheme guys are really pedantic, I will take their word for it :)
leppie
The upper case ß was just added to the Unicode standard by updating some ISO standard back in April, so font support is really rare. :) Also, the Duden has not accepted it into the standard language, yet, so yours *is* correct. :) Just wanted to point another future possibility.
hangy
Thanks for the clarification, will reference your post :)
leppie
+1  A: 
e.James
That's pretty much exactly the macro as it used to be in strings.h.
Paul Tomblin
@Paul Tomblin: Nice! I was hoping to come close :)
e.James
What about the upper 128 chars? Did you mean 7-bit?
leppie
Come to think of it, if I remember correctly, I think the macro actually added ('A'-'a'). And yes, @leppie, it only worked for ASCII, which by definition is 7 bit.
Paul Tomblin
the check for (c < 'a') || ( c > 'z') takes care of 128..255 (or 0..-127 if a signed char is provided). Bottom line is that only the 26 characters from 'a' to 'z' are modified
e.James
eJames: the nitpick was that ASCII is only 7 bit. The eight bit is always 0 or you're not really using ASCII.
Joachim Sauer
Ah, OK. Fair enough! I shall modify my answer.
e.James
+1  A: 

I would buy a server and a domain www.toupper.com and put a neat web service there to convert text to upper. And from the server side this funtion would be implemented by leppie. Leppie would figure out in most languages how to convert text to upper.

tomaszs
I guess your solution doesn't aim for performance. Let say you would like to upper 100k words, that would be so long haha
Daok
Who voted up this rubbish?
Daniel Cassidy
Daniel, should I take it as offensive? Question seems to be funny so i see no contradiction to answer in a funny way.
tomaszs
A: 

in python ..

touppe_map = { massive dictionary to handle all cases in all languages }
def to_upper( c ):
   return toupper_map.get( c, c )

or, if you want to do it the "wrong way"

def to_upper( c ):
  for k,v in toupper_map.items():
     if k == c: return v
  return c
hasen j
A: 

Let me suggest even more bonus points for languages such as Hebrew, Arabic, Georgian and others that just do not have capital (upper case) letters. :-)

bgbg
for those languages it would be extremely simple ... anyway Arabic and Hebrew have their own set of string manipulation functionality they require.
hasen j
+2  A: 

No static table is going to be sufficient because you need to know the language before you know the correct transforms.

e.g. In Turkish i needs to go to İ (U+0130) whereas in any other language is needs to go to I (U+0049) . And the i is the same character U+0069.

Douglas Leeder
Uff. I guess that's why a proper i18n library takes up >10MB. Crazy people. Why couldn't our ancestors just settle for a nice simple SINGLE writing system? :P
Vilx-
+7  A: 
  1. I download the Unicode tables
  2. I import the tables into a database
  3. I write a method upper().

Here is a sample implementation ;)

public static String upper(String s) {
    if (s == null) {
        return null;
    }

    final int N = s.length(); // Mind the optimization!
    PreparedStatement stmtName = null;
    PreparedStatement stmtSmall = null;
    ResultSet rsName = null;
    ResultSet rsSmall = null;
    StringBuilder buffer = new StringBuilder (N); // Much faster than StringBuffer!
    try {
        conn = DBFactory.getConnection();
        stmtName = conn.prepareStatement("select name from unicode.chart where codepoint = ?");
        // TODO Optimization: Maybe move this in the if() so we don't create this
        // unless there are uppercase characters in the string.
        stmtSmall = conn.prepareStatement("select codepoint from unicode.chart where name = ?");
        for (int i=0; i<N; i++) {
            int c = s.charAt(i);
            stmtName.setInt(1, c);
            rsName = stmtName.execute();
            if (rsName.next()) {
                String name = rsName.getString(1);
                if (name.contains(" SMALL ")) {
                    name = name.replaceAll(" SMALL ", " CAPITAL ");

                    stmtSmall.setString(1, name);
                    rsSmall = stmtSmall.execute();
                    if (rsSmall.next()) {
                        c = rsSmall.getInt(1);
                    }

                    rsSmall = DBUtil.close(rsSmall);
                }
            }
            rsName = DBUtil.close(rsName);
        }
    }
    finally {
        // Always clean up
        rsSmall = DBUtil.close(rsSmall);
        rsName = DBUtil.close(rsName);
        stmtSmall = DBUtil.close(stmtSmall);
        stmtName = DBUtil.close(stmtName);
    }

    // TODO Optimization: Maybe read the table once into RAM at the start
    // Would waste a lot of memory, though :/
    return buffer.toString();
}

;)

Note: The unicode charts which you can find on unicode.org contain the name of the character/code point. This string will contain " SMALL " for characters which are uppercase (mind the blanks or it might match "SMALLER" and the like). Now, you can search for a similar name with "SMALL" replaced with "CAPITAL". If you find it, you've found the captial version.

Aaron Digulla
4. PROFIT :) +1 nice answer
leppie