ansaurus

Question

How do I use unicode (UTF-8) characters in Clojure regular expressions?

Answer 1

+2 A:

for international characters you need to use Java Character classes, something like [\p{javaLowerCase}\p{javaUpperCase}]+ to match any word character... \w is used for ASCII - see java.util.Regex documentation

Alex Ott 2010-06-23 12:39:04

Answer 2

+7 A:

Can't help with swank or Emacs, I'm afraid. I'm using Enclojure on NetBeans and it works well there.

On matching: As Alex said, \w doesn't work for non-English characters, not even the extended Latin charsets for Western Europe:

(re-seq #"\w+" "prøve")  =>("pr" "ve")   ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große")  => ("gro" "e")  ; German
(re-seq #"\w+" "plaît")  => ("pla" "t")  ; French

The \w skips the extended chars. Using [(?u)\w]+ instead makes no difference, same with the Japanese.

But see this regex reference: \p{L} matches any Unicode character in category Letter, so it actually works for Norwegian

(re-seq #"\p{L}+" "prøve")
=> ("prøve")

as well as for Japanese (at least I suppose so, I can't read it but it seems to be in the ballpark):

(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当？")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")

There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.

Edit: More on Unicode in Java

A quick reference to other points of potential interest when working with Unicode.

Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.

This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).

java.nio.charset.Charset - represents a charset like US-ASCII, ISO-8859-1, UTF-8
java.io.InputStreamReader - lets you specify a charset to translate from bytes to strings when reading. There is a corresponding OutputStreamWriter.
java.lang.String - lets you specify a charset when creating a String from an array of bytes.
java.lang.Character - has methods for getting the Unicode category of a character and converting between Java chars and Unicode code points.
java.util.regex.Pattern - specification of regexp patterns, including Unicode blocks and categories.

Java characters/strings are UTF-16 internally. The char type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.

When dealing with non-Latin Unicode it's often better to use code points rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.

unicode.org - the Unicode standard and code charts.

I'm putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.

j-g-faustus 2010-06-23 13:06:20

Nice! Yes, the \p{L} switch was exactly what I wanted, and I'm actually a little surprised that it worked for Japanese. Thanks for the link, too.

ivar 2010-06-24 03:59:29

You're welcome. Good luck with the word breaks :)

j-g-faustus 2010-06-24 07:38:22

Answer 3

+1 A:

For katakana, Wikipedia shows you the Unicode ordering. So if you wanted to use a regex character class that caught all the katakana, I suppose you could do something like this:

user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当？")
("スペース")

Hiragana, for what it's worth:

user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当？")
("の" "には" "が" "ないって")

I'd be pretty amazed if any regex could detect Japanese word breaks.

Brian Carper 2010-06-23 17:37:53

Thanks for the help, Mr. Japanese-reading Cow. ^_^ Agreed, a regex that could detect Japanese word breaks would be absolutely incredible.

ivar 2010-06-24 03:57:25

Answer 4

+2 A:

I'll answer half a question here:

How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL?

A more interactive way:

M-x customize-group
"slime-lisp"
Find the option for slime coding system, and select utf-8-unix. Save this so Emacs picks it up in your next session.

Or place this in your .emacs:

(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))

That's what the interactive menu will do anyway.

Works on Emacs 23 and works on my machine

Leonel 2010-07-23 12:33:05

ansaurus

tags:

views:

answers:

How do I use unicode (UTF-8) characters in Clojure regular expressions?

related questions