In different encodings of Unicode, for example UTF-16le or UTF-8, a character may occupy 2 or 3 bytes. Many Unicode applications doesn't take care of display width of Unicode chars just like they are all Latin letters. For example, in 80-column text, which should contains 40 Chinese characters or 80 Latin letters in one line, but most application (like Eclipse, Notepad++, and all well-known text editors, I dare if there's any good exception) just count each Chinese character as 1 width as Latin letter. This certainly make the result format ugly and non-aligned.
For example, a tab-width of 8 will get the following ugly result (count all Unicode as 1 display width):
apple 10
banana 7
苹果 6
猕猴桃 31
pear 16
However, the expected format is (Count each Chinese character as 2 width):
apple 10
banana 7
苹果 6
猕猴桃 31
pear 16
The improper calculation on display width of chars make these editors totally useless when doing tab-align, and line wrapping and paragraph reformat.
Though, the width of a character may vary between different fonts, but in all cases of Fixed-size terminal font, Chinese character is always double width. That is to say, in despite of font, each Chinese character is preferred to display in 2 width.
One of solution is, I can get the correct width by convert the encoding to GB2312, in GB2312 encoding each Chinese character takes 2 bytes. however, some Unicode characters doesn't exist in GB2312 charset (or GBK charset). And, in general it's not a good idea to compute the display width from the encoded size in bytes.
To simply calculate all character in Unicode in range of (\u0080
..\uFFFF
) as 2 width is also not correct, because there're also many 1-width chars scattered in the range.
There's also difficult when calculate the display width of Arabic letters and Korean letters, because they construct a word/character by arbitrary number of Unicode code points.
So, the display width of a Unicode code point maybe not an integer, I deem that is ok, they can be grounded to integer in practice, at least better than none.
So, is there any attribute related to the preferred display width of a char in Unicode standard? Or any Java library function to calculate the display width?