ansaurus

Question

Detecting Unicode text ligatures in Clojure/Java

Answer 1

+2 A:

The Computer Typesetting wikipedia page says -

The Computer Modern Roman typeface provided with TeX includes the five common ligatures ff, fi, fl, ffi, and ffl. When TeX finds these combinations in a text it substitutes the appropriate ligature, unless overridden by the typesetter.

This indicates that it's the editor that does substitution. Moreover,

Unicode maintains that ligaturing is a presentation issue rather than a character definition issue, and that, for example, "if a modern font is asked to display 'h' followed by 'r', and the font has an 'hr' ligature in it, it can display the ligature."

As far as I see (I got some interest in this topic and just now reading few articles), the instructions for ligature substitute is embeded inside font. Now, I dug into more and found these for you; GSUB - The Glyph Substitution Table and Ligature Substitution Subtable from the OpenType file format specification.

Next, you need to find some library which can allow you to peak inside OpenType font files, i.e. file parser for quick access. Reading the following two discussions may give you some directions in how to do these substitutions:

Chromium bug http://code.google.com/p/chromium/issues/detail?id=22240
Firefox bug https://bugs.launchpad.net/firefox/+bug/37828

Ankit Jain 2010-08-12 10:55:35

Looks good. I'll go through the articles and bug patch codes and try to find a solution.

abhin4v 2010-08-12 11:13:06

Answer 2

+2 A:

What you are talking about are not ligatures (at least not in Unicode parlance) but grapheme clusters. There is a standard annex that is concerned with discovering text boundaries, including grapheme cluster boundaries:

http://www.unicode.org/reports/tr29/tr29-15.html#Grapheme_Cluster_Boundaries

Also see the description of tailored grapheme clusters in regular expressions:

http://unicode.org/reports/tr18/#Tailored_Graphemes_Clusters

And the definition of collation graphemes:

http://www.unicode.org/reports/tr10/#Collation_Graphemes

I think that these are starting points. The harder part will probably be to find a Java implementation of the Unicode collation algorithm that works for Devanagari locales. If you find one, you can analyze strings without resorting to OpenType features. This would be a bit cleaner since OpenType is concerned with purely presentational details and not with character or grapheme cluster semantics, but the collation algorithm and the tailored grapheme cluster boundary finding algorithm look as if they can be implemented independently of fonts.

Philipp 2010-08-12 11:16:14

Answer 3

+1 A:

You may be able to get this information from the GlyphVector class.

For a given String a Font instance can create a GlyphVector that can provide information about the rendering of the text.

The layoutGlyphVector() method on the Font can provide this.

The FLAG_COMPLEX_GLYPHS attribute of the GlyphVector can tell you if the text does not have a 1 to 1 mapping with the input characters.

The following code shows an example of this:

JTextField textField = new JTextField();
String textToTest = "abcdefg";
FontRenderContext fontRenderContext = textField.getFontMetrics(font).getFontRenderContext();

GlyphVector glyphVector = font.layoutGlyphVector(fontRenderContext, textToTest.toCharArray(), 0, 4, Font.LAYOUT_LEFT_TO_RIGHT);
int layoutFlags = glyphVector.getLayoutFlags();
boolean hasComplexGlyphs = (layoutFlags & GlyphVector.FLAG_COMPLEX_GLYPHS) != 0;
int numberOfGlyphs = glyphVector.getNumGlyphs();

numberOfGlyphs should represent the number of characters used to display the input text.

Unfortunately you need to create a java GUI component to get the FontRenderContext.

Aaron 2010-08-12 11:26:23

Does not work. `hasComplexGlyphs` comes as true but `numberOfGlyphs` returns the same number as the length of the unicode text.

abhin4v 2010-08-12 13:20:59

Answer 4

A:

I think that what you are really looking for is Unicode Normalization.

For Java you should check http://download.oracle.com/javase/6/docs/api/java/text/Normalizer.html

By choosing the proper normalization form you can obtain what you are looking for.

Sorin Sbarnea 2010-08-12 12:33:36

Does not work. All normalization modes return the same text as the input.

abhin4v 2010-08-12 13:21:43

Normalization works on the level of code points and Unicode equivalence relations and has no notion of grapheme clusters.

Philipp 2010-08-12 14:06:34

Answer 5

+1 A:

While Aaron's answer is not exactly correct, it pushed me in the right direction. After reading through the Java API docs of java.awt.font.GlyphVector and playing a lot on the Clojure REPL, I was able to write a function which does what I want.

The idea is to find the width of glyphs in the glyphVector and combine the glyphs with zero width with the last found non-zero width glyph. The solution is in Clojure but it should be translatable to Java if required.

(ns net.abhinavsarkar.unicode
  (:import [java.awt.font TextAttribute GlyphVector]
           [java.awt Font]
           [javax.swing JTextArea]))

(let [^java.util.Map text-attrs {
        TextAttribute/FAMILY "Arial Unicode MS"
        TextAttribute/SIZE 25
        TextAttribute/LIGATURES TextAttribute/LIGATURES_ON}
      font (Font/getFont text-attrs)
      ta (doto (JTextArea.) (.setFont font))
      frc (.getFontRenderContext (.getFontMetrics ta font))]
  (defn unicode-partition
    "takes an unicode string and returns a vector of strings by partitioning
    the input string in such a way that multiple code points of a single
    ligature are in same partition in the output vector"
    [^String text]
    (let [glyph-vector 
            (.layoutGlyphVector
              font, frc, (.toCharArray text),
              0, (.length text), Font/LAYOUT_LEFT_TO_RIGHT)
          glyph-num (.getNumGlyphs glyph-vector)
          glyph-positions
            (map first (partition 2
                          (.getGlyphPositions glyph-vector 0 glyph-num nil)))
          glyph-widths
            (map -
              (concat (next glyph-positions)
                      [(.. glyph-vector getLogicalBounds width)])
              glyph-positions)
          glyph-indices 
            (seq (.getGlyphCharIndices glyph-vector 0 glyph-num nil))
          glyph-index-width-map (zipmap glyph-indices glyph-widths)
          corrected-glyph-widths
            (vec (reduce
                    (fn [acc [k v]] (do (aset acc k v) acc))
                    (make-array Float (count glyph-index-width-map))
                    glyph-index-width-map))]
      (loop [idx 0 pidx 0 char-seq text acc []]
        (if (nil? char-seq)
          acc
          (if-not (zero? (nth corrected-glyph-widths idx))
            (recur (inc idx) (inc pidx) (next char-seq)
              (conj acc (str (first char-seq))))
            (recur (inc idx) pidx (next char-seq)
              (assoc acc (dec pidx)
                (str (nth acc (dec pidx)) (first char-seq))))))))))

Also posted on Gist.

abhin4v 2010-08-13 10:44:50

ansaurus

tags:

views:

answers:

Detecting Unicode text ligatures in Clojure/Java

related questions