views:

132

answers:

3

how are non-english programming/scripting languages developed ?

do you need to be a computer scientist ?

+3  A: 

You need to understand how Unicode works to build a parser in an international language, and yes you do need to be a CS major, or possess the ability to self-teach yourself compiler design.

  1. Study unicode -- learn to use ICU -- or a language with GOOD Unicode support.
  2. Decide on and Build a VM (or use an existing one).
  3. Write a lexxer / parser or use something like ANTLR (Java based) .
  4. decide on a AST
  5. Generate the instruction stream for the VM.
Hassan Syed
+2  A: 

check out "Principles of Compiler Design"

stillstanding
allready provided the link to that :D
Hassan Syed
+1  A: 

You use a character set capable of encoding extended characters, such as UTF8. Unicode sets above the 8 bit are written in double byte notation for UTF16 or quadruple byte notation for UTF32. The problem that arises is with regard to dibi, bidirectional notation, where language using different bidi notations may read the bytes in different orders. The solution to the bidi problem was through specification of the byte order prior to the character encoding, but the problem remains of what is before with regard to differences of bidi. So the byte order is clearly stated through a more specific subset of the Unicode character sets. UTF16BE, for big endian, mandates the byte order specification comes prior to the character encoding in a right to left interpretation. The opposite would be UTF16LE, or little endian.

There is also the UCS, Universal Character Set. This term is still used, but it is deprecated as it is not specific enough in concern for the problem mentioned above about characters whose mapping takes more than one byte. For information about the differences between UCS and Unicode please read this: http://en.wikipedia.org/wiki/Universal_Character_Set#Differences_between_ISO_10646_and_Unicode

Some examples are the following:
IRI - RFC 3987 - http://www.ietf.org/rfc/rfc3987.txt - mandates UTF8 encoding
Mail Markup Language - http://mailmarkup.org/ - mandates UTF16BE encoding