Commonly used ofc, Klingon doesnt count :-)
thanks, guys, let me run willItFit() testcases
OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again
Commonly used ofc, Klingon doesnt count :-)
thanks, guys, let me run willItFit() testcases
OK, now i figured out what saving bytes with UTF-8 is causing more problems than solving, thanks again
Characters requiring 3 bytes start at U+0800 and all subsequent characters, so that's a HUGE number of potential characters. This includes East Asian scripts such as Japanese, Chinese, Korean, and Thai.
For a complete list of script ranges, you can refer to Unicode's block data. Only these blocks can be represented with 1 or 2 bytes, characters from all other blocks require 3 or 4 bytes:
0000..007F Basic Latin
0080..00FF Latin-1 Supplement
0100..017F Latin Extended-A
0180..024F Latin Extended-B
0250..02AF IPA Extensions
02B0..02FF Spacing Modifier Letters
0300..036F Combining Diacritical Marks
0370..03FF Greek and Coptic
0400..04FF Cyrillic
0500..052F Cyrillic Supplement
0530..058F Armenian
0590..05FF Hebrew
0600..06FF Arabic
0700..074F Syriac
0750..077F Arabic Supplement
0780..07BF Thaana
07C0..07FF NKo
There are representations of many Asian languages that use more than 2 bytes. While it's true that they probably don't specifically need to, Japanese and Korean (at least) are often represented in multi-byte form.
Here we go:
So the first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This includes Latin letters with diacritics and characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.
More details:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes , Basic Multilingual Plane, Codes from 0x8000.
Some examples: Indic scripts, Thai, Philippine scripts, Hiragana, Katakana. So all East Asia scripts and some other.