How would you write ToUpper() if it didn't exist? Bonus points for i18n and L10n
Curiosity sparked by this: http://thedailywtf.com/Articles/The-Long-Way-toUpper.aspx
How would you write ToUpper() if it didn't exist? Bonus points for i18n and L10n
Curiosity sparked by this: http://thedailywtf.com/Articles/The-Long-Way-toUpper.aspx
I dont think SO can handle the size of the unicode tables in a single posting :)
Unfortunately, it is not so easy as just char.ToUpper() every character.
Example:
(string-upcase "Straße") ⇒ "STRASSE"
(string-downcase "Straße") ⇒ "straße"
(string-upcase "ΧΑΟΣ") ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ") ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ") ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ") ⇒ "χαος σ"
(string-upcase "χαος") ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ") ⇒ "ΧΑΟΣ"
I would buy a server and a domain www.toupper.com and put a neat web service there to convert text to upper. And from the server side this funtion would be implemented by leppie. Leppie would figure out in most languages how to convert text to upper.
in python ..
touppe_map = { massive dictionary to handle all cases in all languages }
def to_upper( c ):
return toupper_map.get( c, c )
or, if you want to do it the "wrong way"
def to_upper( c ):
for k,v in toupper_map.items():
if k == c: return v
return c
Let me suggest even more bonus points for languages such as Hebrew, Arabic, Georgian and others that just do not have capital (upper case) letters. :-)
No static table is going to be sufficient because you need to know the language before you know the correct transforms.
e.g. In Turkish i
needs to go to İ
(U+0130) whereas in any other language is needs to go to I
(U+0049) . And the i
is the same character U+0069.
Here is a sample implementation ;)
public static String upper(String s) {
if (s == null) {
return null;
}
final int N = s.length(); // Mind the optimization!
PreparedStatement stmtName = null;
PreparedStatement stmtSmall = null;
ResultSet rsName = null;
ResultSet rsSmall = null;
StringBuilder buffer = new StringBuilder (N); // Much faster than StringBuffer!
try {
conn = DBFactory.getConnection();
stmtName = conn.prepareStatement("select name from unicode.chart where codepoint = ?");
// TODO Optimization: Maybe move this in the if() so we don't create this
// unless there are uppercase characters in the string.
stmtSmall = conn.prepareStatement("select codepoint from unicode.chart where name = ?");
for (int i=0; i<N; i++) {
int c = s.charAt(i);
stmtName.setInt(1, c);
rsName = stmtName.execute();
if (rsName.next()) {
String name = rsName.getString(1);
if (name.contains(" SMALL ")) {
name = name.replaceAll(" SMALL ", " CAPITAL ");
stmtSmall.setString(1, name);
rsSmall = stmtSmall.execute();
if (rsSmall.next()) {
c = rsSmall.getInt(1);
}
rsSmall = DBUtil.close(rsSmall);
}
}
rsName = DBUtil.close(rsName);
}
}
finally {
// Always clean up
rsSmall = DBUtil.close(rsSmall);
rsName = DBUtil.close(rsName);
stmtSmall = DBUtil.close(stmtSmall);
stmtName = DBUtil.close(stmtName);
}
// TODO Optimization: Maybe read the table once into RAM at the start
// Would waste a lot of memory, though :/
return buffer.toString();
}
;)
Note: The unicode charts which you can find on unicode.org contain the name of the character/code point. This string will contain " SMALL " for characters which are uppercase (mind the blanks or it might match "SMALLER" and the like). Now, you can search for a similar name with "SMALL" replaced with "CAPITAL". If you find it, you've found the captial version.