ligature

Searching unicode text using regex

Searching a file which is written in Hindi(Devanagri) (UTF-16) gave rise to the following problem. The file contains: त्रास ततत जुग नींद ना हा बु Note that the first char 'त्र' is a multiple code point of त + ् + र Now while searching for 'त' I get 4 matches including the त of the first char. I am using Java. How can I go abo...

Detecting Unicode text ligatures in Clojure/Java

Ligatures are the Unicode characters which are represented by more than one code points. For example, in Devanagari त्र is a ligature which consists of code points त + ् + र. When seen in simple text file editors like Notepad, त्र is shown as त् + र and is stored as three Unicode characters. However when the same file is opened in Fire...