how are non-english programming/scripting languages developed ?
do you need to be a computer scientist ?
how are non-english programming/scripting languages developed ?
do you need to be a computer scientist ?
You need to understand how Unicode works to build a parser in an international language, and yes you do need to be a CS major, or possess the ability to self-teach yourself compiler design.
You use a character set capable of encoding extended characters, such as UTF8. Unicode sets above the 8 bit are written in double byte notation for UTF16 or quadruple byte notation for UTF32. The problem that arises is with regard to dibi, bidirectional notation, where language using different bidi notations may read the bytes in different orders. The solution to the bidi problem was through specification of the byte order prior to the character encoding, but the problem remains of what is before with regard to differences of bidi. So the byte order is clearly stated through a more specific subset of the Unicode character sets. UTF16BE, for big endian, mandates the byte order specification comes prior to the character encoding in a right to left interpretation. The opposite would be UTF16LE, or little endian.
There is also the UCS, Universal Character Set. This term is still used, but it is deprecated as it is not specific enough in concern for the problem mentioned above about characters whose mapping takes more than one byte. For information about the differences between UCS and Unicode please read this: http://en.wikipedia.org/wiki/Universal_Character_Set#Differences_between_ISO_10646_and_Unicode
Some examples are the following:
IRI - RFC 3987 - http://www.ietf.org/rfc/rfc3987.txt - mandates UTF8 encoding
Mail Markup Language - http://mailmarkup.org/ - mandates UTF16BE encoding