ansaurus

Question

How do I get the set of all letters in Java/Clojure?

Answer 1

+1 A:

I'm pretty sure the letters aren't available in the standard library, so you're probably left with the manual approach.

Bozhidar Batsov 2010-04-05 12:10:24

Answer 2

+4 A:

No, because that is just printing out the ASCII letters rather than the full set. Of course, it's trivial to print out the 26 lower case and upper case letters using two for loops but the fact is that there are many more "letters" outside of the first 127 code points. Java's "isLetter" fn on Character will be true for these and many others.

AlBlue 2010-04-05 12:11:57

That's an excellent point, but I'm not terribly worried about unicode right now. That said, I suppose I could just use the manual approach. It's not like the alphabet is in danger of changing soon. :-)

Jason Baker 2010-04-05 12:22:12

@Jason: The letter "Capital ß" has entered the Unicode standard in 2008! And that's a letter from the latin alphabet! (Granted, it's used very rarely, but still: not even alphabets are safe from change).

Joachim Sauer 2010-04-07 12:39:28

Answer 3

+1 A:

The same result as mentioned in your question would be given by the following statement that has to be manually generated in contrast to the Python solution:

public class Letters {

    public static String asString() {
        StringBuffer buffer = new StringBuffer();
        for (char c = 'a'; c <= 'z'; c++)
            buffer.append(c);
        for (char c = 'A'; c <= 'Z'; c++)
            buffer.append(c);
        return buffer.toString();
    }

    public static void main(String[] args) {
        System.out.println(Letters.asString());
    }

}

codescape 2010-04-05 12:23:37

Answer 4

+8 A:

A properly non-ASCII-centric implementation:

private static String allLetters(String charsetName)
{
    CharsetEncoder ce = Charset.forName(charsetName).newEncoder();
    StringBuilder result = new StringBuilder();
    for(char c=0; c<Character.MAX_VALUE; c++)
    {
        if(ce.canEncode(c) && Character.isLetter(c))
        {
            result.append(c);
        }
    }
    return result.toString();
}

Call this with "US-ASCII" and you'll get the desired result (except that uppercase letters come first). You could call it with Charset.defaultCharset(), but I suspect that you'd get far more than the ASCII letters on most systems, even in the USA.

Caveat: only considers the basic multilingual plane. Wouldn't be too hard to extend to the supplementary planes, but it would take a lot longer, and the utility is questionable.

Michael Borgwardt 2010-04-05 12:26:10

Character.isLetter(char) is more than uppercases and lowercases: A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following: * UPPERCASE_LETTER * LOWERCASE_LETTER * TITLECASE_LETTER * MODIFIER_LETTER * OTHER_LETTER Not all letters have case. Many characters are letters but are neither uppercase nor lowercase nor titlecase.

Michael Konietzka 2010-04-05 13:34:30

Answer 5

+2 A:

string.letters: The concatenation of the strings lowercase and uppercase described below. The specific value is locale-dependent, and will be updated when locale.setlocale() is called.

I modified the answer from Michael Borgwardt. In my implementation there are two lists lowerCases and upperCases for two reasons:

string.letters is lowercases followed by uppercases.
Java Character.isLetter(char) is more than just uppercases and lowercases, so use of Character.isLetter(char) will return to much results under some charsets, for example "windows-1252"

From Api-Doc: Character.isLetter(char):

A character is considered to be a letter if its general category type, provided by Character.getType(ch), is any of the following:
* UPPERCASE_LETTER
* LOWERCASE_LETTER
* TITLECASE_LETTER
* MODIFIER_LETTER
* OTHER_LETTER 
Not all letters have case. Many characters are letters but are neither uppercase nor lowercase nor titlecase.

So if string.letters should only return lowercases and uppercases, the TITLECASE_LETTER, ,MODIFIER_LETTER and OTHER_LETTER chars have to be ignored.

public static String allLetters(final Charset charset) {
    final CharsetEncoder encoder = charset.newEncoder();
    final StringBuilder lowerCases = new StringBuilder();
    final StringBuilder upperCases = new StringBuilder();
    for (char c = 0; c < Character.MAX_VALUE; c++) {
    if (encoder.canEncode(c)) {
    if (Character.isUpperCase(c)) {
    upperCases.append(c);
    } else if (Character.isLowerCase(c)) {
    lowerCases.append(c);
    }
    }
    }
    return lowerCases.append(upperCases).toString();
}

Additionally: the behaviour of string.letters changes when changing the locale. This maybe won't apply to my solution, because changing the default locale does not change the default charset. From apiDoc:

The default charset is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system.

I guess, the default charset cannot be changed within the started JVM. So the "change locale" behaviour of string.letters can not be realizied with just Locale.setDefault(Locale). But changing the default locale is anyway a bad idea:

Since changing the default locale may affect many different areas of functionality, this method should only be used if the caller is prepared to reinitialize locale-sensitive code running within the same Java Virtual Machine.

Michael Konietzka 2010-04-05 13:28:23

Answer 6

+9 A:

If you just want Ascii chars,

(map char (concat (range 65 91) (range 97 123)))

will yield,

(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z 
 \a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)

Hamza Yerlikaya 2010-04-05 14:17:56

+1 No need to wrap the call to char in an anonymous function, `(map char (concat (range 65 91) (range 97 123)))` will work just fine.

Jonas 2010-04-06 04:12:26

Answer 7

+3 A:

Based on Michaels imperative Java solution, this is a idiomatic (lazy sequences) Clojure solution:

(ns stackoverflow
  (:import (java.nio.charset Charset CharsetEncoder)))

(defn all-letters [charset]
  (let [encoder (. (Charset/forName charset) newEncoder)]
    (letfn [(valid-char? [c]
             (and (.canEncode encoder (char c)) (Character/isLetter c)))
        (all-letters-lazy [c]
                  (when (<= c (int Character/MAX_VALUE))
                (if (valid-char? c)
                  (lazy-seq
                   (cons (char c) (all-letters-lazy (inc c))))
                  (recur (inc c)))))]
      (all-letters-lazy 0))))

Update: Thanks cgrand for this preferable high-level solution:

(defn letters [charset-name]
  (let [ce (-> charset-name java.nio.charset.Charset/forName .newEncoder)]
    (->> (range 0 (int Character/MAX_VALUE)) (map char)
         (filter #(and (.canEncode ce %) (Character/isLetter %))))))

But the performace comparison between my first approach

user> (time (doall (stackoverflow/all-letters "ascii"))) 
"Elapsed time: 33.333336 msecs"                                                  
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)

and your solution

user> (time (doall (stackoverflow/letters "ascii"))) 
"Elapsed time: 666.666654 msecs"                                                 
(\A \B \C \D \E \F \G \H \I \J \K \L \M \N \O \P \Q \R \S \T \U \V \W \X \Y \Z \\
a \b \c \d \e \f \g \h \i \j \k \l \m \n \o \p \q \r \s \t \u \v \w \x \y \z)

is quite interesting.

Jürgen Hötzel 2010-04-05 18:49:53

Idiomatic lazy seq fns scarcely use lazy-seq: lazy-seq is low-level. The core of your code is better written as: (->> (range 0 (int Character/MAX_VALUE)) (map char) (filter #(and (.canEncode ce %) (Character/isLetter %))) see http://gist.github.com/357407. Another thing: . and .. are somewaht legacy so don't use them.

cgrand 2010-04-06 09:35:14

Thanks! Why is "." and ".." considered legacy? Any resources?

Jürgen Hötzel 2010-04-08 20:11:38

-> is a better .. since you mix fn and methods (in .method notation), so .. has no interest except saving you a dot per method call (and making less easy to spot them when you go type-hinting). And (.method obj) is more lispy by putting the method in function position. Similarly prefer Foo. to (new Foo).Give the sugarized forms (.foo, Foo. and Foo/BAR) a try and you'll see they are much nicer to use (and allow for easier factorization later on).

cgrand 2010-04-12 09:28:03

Answer 8

A:

In case you don't remember code points ranges. Brute force way :-P :

user> (require '[clojure.contrib.str-utils2 :as stru2])
nil
user> (set (stru2/replace (apply str (map char (range 0 256))) #"[^A-Za-z]" ""))
#{\A \a \B \b \C \c \D \d \E \e \F \f \G \g \H \h \I \i \J \j \K \k \L \l \M \m \N \n \O \o \P \p \Q \q \R \r \S \s \T \t \U \u \V \v \W \w \X \x \Y \y \Z \z}
user>

nipra 2010-04-07 12:24:34

ansaurus

tags:

views:

answers:

How do I get the set of all letters in Java/Clojure?

related questions