views:

127

answers:

2

When writing interpreters for PDF, HTML and other documents we need to deal with a variety of white-space characters and additional non-printing characters. The ANSI ones are well defined, but how many others are likely to be found in practice? A typical example is the cluster in ISO10646 (I think):

                   en space
            em space
              thin space
‌  ‌  ‌  ‌  zero width non-joiner
‍   ‍  ‍  ‍  zero width joiner
‎   ‎  ‎  ‎  left-to-right mark
‏   ‏  ‏  ‏  right-to-left mark

(For obvious reasons the characters do not appear above!).

+1  A: 

In development world there's at least one more (most often used in web development)

   // non-breaking space

But the more you get to design world the more you see various space/invisible characters. Publishing software normally has

  • space - the regular SPACE
  • en space
  • em space
  • thin space
  • hair space
  • non-breaking space
  • non-breaking fixed width space
  • sixth space
  • quarter space
  • third space
  • punctuation space
  • flush space
  • figure space
  • ...
Robert Koritnik
Yes, 0xA0 ; see http://en.wikipedia.org/wiki/Non-breaking_space
peter.murray.rust
@Robert could you please list the numbers?
peter.murray.rust
#fail. I've just written the ones I see in my InDesign. I'm not sure if all of them are actual UNICODE standard ones. Sorry. Some are rather design oriented (like flush space) and maybe exist only in software.
Robert Koritnik
+1  A: 

Unicode will be with us, in increasing quantity, for a long time. If an HTML or XML document is written in UTF-8 encoded Unicode, then you should expect any and all of these to appear.

In Unicode (Unicode Character Database) the following codepoints are defined as whitespace:

U+0009–U+000D (control characters, containing Tab, CR and LF)
U+0020 SPACE
U+0085 NEL (control character next line)
U+00A0 NBSP (NO-BREAK SPACE)
U+1680 OGHAM SPACE MARK
U+180E MONGOLIAN VOWEL SEPARATOR
U+2000–U+200A (different sorts of spaces)
U+2028 LS (LINE SEPARATOR)
U+2029 PS (PARAGRAPH SEPARATOR)
U+202F NNBSP (NARROW NO-BREAK SPACE)
U+205F MMSP (MEDIUM MATHEMATICAL SPACE)
U+3000 IDEOGRAPHIC SPACE
Michael Dillon
@Michael thanks - useful. Doesn't overlap with the ones I listed.
peter.murray.rust