ansaurus

Question

Answer 1

+6 A:

I dont think SO can handle the size of the unicode tables in a single posting :)

Unfortunately, it is not so easy as just char.ToUpper() every character.

Example:

(string-upcase "Straße")    ⇒ "STRASSE"
(string-downcase "Straße")  ⇒ "straße"
(string-upcase "ΧΑΟΣ")      ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ")    ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ")   ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ")  ⇒ "χαος σ"
(string-upcase "χαος")      ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ")      ⇒ "ΧΑΟΣ"

leppie 2008-12-02 13:49:51

(string-upcase "Straße") ⇒ "STRAẞE"

hangy 2008-12-02 15:17:47

Hangy, sorry, that does not render. Also my conversions are local-independent (guess I should have mentioned that ;p).

leppie 2008-12-02 16:04:46

And I simply pasted from the R6RS Scheme spec, it could be a typo, will check the tests.

leppie 2008-12-02 16:05:51

Seems to be correct. These Scheme guys are really pedantic, I will take their word for it :)

leppie 2008-12-02 16:07:18

The upper case ß was just added to the Unicode standard by updating some ISO standard back in April, so font support is really rare. :) Also, the Duden has not accepted it into the standard language, yet, so yours *is* correct. :) Just wanted to point another future possibility.

hangy 2008-12-03 08:15:11

Thanks for the clarification, will reference your post :)

leppie 2008-12-03 18:44:25

Answer 2

+1 A:

e.James 2008-12-02 13:50:06

That's pretty much exactly the macro as it used to be in strings.h.

Paul Tomblin 2008-12-02 13:52:08

@Paul Tomblin: Nice! I was hoping to come close :)

e.James 2008-12-02 13:54:17

What about the upper 128 chars? Did you mean 7-bit?

leppie 2008-12-02 13:57:23

Come to think of it, if I remember correctly, I think the macro actually added ('A'-'a'). And yes, @leppie, it only worked for ASCII, which by definition is 7 bit.

Paul Tomblin 2008-12-02 14:01:29

the check for (c < 'a') || ( c > 'z') takes care of 128..255 (or 0..-127 if a signed char is provided). Bottom line is that only the 26 characters from 'a' to 'z' are modified

e.James 2008-12-02 14:26:39

eJames: the nitpick was that ASCII is only 7 bit. The eight bit is always 0 or you're not really using ASCII.

Joachim Sauer 2008-12-02 14:48:23

Ah, OK. Fair enough! I shall modify my answer.

e.James 2008-12-02 15:05:19

Answer 3

+1 A:

I would buy a server and a domain www.toupper.com and put a neat web service there to convert text to upper. And from the server side this funtion would be implemented by leppie. Leppie would figure out in most languages how to convert text to upper.

tomaszs 2008-12-02 13:55:17

I guess your solution doesn't aim for performance. Let say you would like to upper 100k words, that would be so long haha

Daok 2008-12-02 14:28:55

Who voted up this rubbish?

Daniel Cassidy 2008-12-02 14:44:02

Daniel, should I take it as offensive? Question seems to be funny so i see no contradiction to answer in a funny way.

tomaszs 2008-12-02 16:27:29

Answer 4

A:

in python ..

touppe_map = { massive dictionary to handle all cases in all languages }
def to_upper( c ):
   return toupper_map.get( c, c )

or, if you want to do it the "wrong way"

def to_upper( c ):
  for k,v in toupper_map.items():
     if k == c: return v
  return c

hasen j 2008-12-02 14:02:00

Answer 5

A:

Let me suggest even more bonus points for languages such as Hebrew, Arabic, Georgian and others that just do not have capital (upper case) letters. :-)

bgbg 2008-12-02 14:16:21

for those languages it would be extremely simple ... anyway Arabic and Hebrew have their own set of string manipulation functionality they require.

hasen j 2008-12-02 14:21:15

Answer 6

+2 A:

No static table is going to be sufficient because you need to know the language before you know the correct transforms.

e.g. In Turkish i needs to go to İ (U+0130) whereas in any other language is needs to go to I (U+0049) . And the i is the same character U+0069.

Douglas Leeder 2008-12-02 14:25:35

Uff. I guess that's why a proper i18n library takes up >10MB. Crazy people. Why couldn't our ancestors just settle for a nice simple SINGLE writing system? :P

Vilx- 2008-12-02 14:31:50

Answer 7

+7 A:

I download the Unicode tables
I import the tables into a database
I write a method upper().

Here is a sample implementation ;)

public static String upper(String s) {
    if (s == null) {
        return null;
    }

    final int N = s.length(); // Mind the optimization!
    PreparedStatement stmtName = null;
    PreparedStatement stmtSmall = null;
    ResultSet rsName = null;
    ResultSet rsSmall = null;
    StringBuilder buffer = new StringBuilder (N); // Much faster than StringBuffer!
    try {
        conn = DBFactory.getConnection();
        stmtName = conn.prepareStatement("select name from unicode.chart where codepoint = ?");
        // TODO Optimization: Maybe move this in the if() so we don't create this
        // unless there are uppercase characters in the string.
        stmtSmall = conn.prepareStatement("select codepoint from unicode.chart where name = ?");
        for (int i=0; i<N; i++) {
            int c = s.charAt(i);
            stmtName.setInt(1, c);
            rsName = stmtName.execute();
            if (rsName.next()) {
                String name = rsName.getString(1);
                if (name.contains(" SMALL ")) {
                    name = name.replaceAll(" SMALL ", " CAPITAL ");

                    stmtSmall.setString(1, name);
                    rsSmall = stmtSmall.execute();
                    if (rsSmall.next()) {
                        c = rsSmall.getInt(1);
                    }

                    rsSmall = DBUtil.close(rsSmall);
                }
            }
            rsName = DBUtil.close(rsName);
        }
    }
    finally {
        // Always clean up
        rsSmall = DBUtil.close(rsSmall);
        rsName = DBUtil.close(rsName);
        stmtSmall = DBUtil.close(stmtSmall);
        stmtName = DBUtil.close(stmtName);
    }

    // TODO Optimization: Maybe read the table once into RAM at the start
    // Would waste a lot of memory, though :/
    return buffer.toString();
}

;)

Note: The unicode charts which you can find on unicode.org contain the name of the character/code point. This string will contain " SMALL " for characters which are uppercase (mind the blanks or it might match "SMALLER" and the like). Now, you can search for a similar name with "SMALL" replaced with "CAPITAL". If you find it, you've found the captial version.

Aaron Digulla 2008-12-02 14:41:54

4. PROFIT :) +1 nice answer

leppie 2008-12-02 14:44:06

ansaurus

tags:

views:

answers:

Reimplementing ToUpper()

related questions