how are non-english programming/scripting languages developed ? | ansaurus

tags:

views:

132

answers:

3

Q:

how are non-english programming/scripting languages developed ?

how are non-english programming/scripting languages developed ?

do you need to be a computer scientist ?

+3 A:

You need to understand how Unicode works to build a parser in an international language, and yes you do need to be a CS major, or possess the ability to self-teach yourself compiler design.

Study unicode -- learn to use ICU -- or a language with GOOD Unicode support.
Decide on and Build a VM (or use an existing one).
Write a lexxer / parser or use something like ANTLR (Java based) .
decide on a AST
Generate the instruction stream for the VM.

Hassan Syed 2010-02-24 11:33:20

+2 A:

check out "Principles of Compiler Design"

stillstanding 2010-02-24 11:35:16

allready provided the link to that :D

Hassan Syed 2010-02-24 11:36:49

+1 A:

You use a character set capable of encoding extended characters, such as UTF8. Unicode sets above the 8 bit are written in double byte notation for UTF16 or quadruple byte notation for UTF32. The problem that arises is with regard to dibi, bidirectional notation, where language using different bidi notations may read the bytes in different orders. The solution to the bidi problem was through specification of the byte order prior to the character encoding, but the problem remains of what is before with regard to differences of bidi. So the byte order is clearly stated through a more specific subset of the Unicode character sets. UTF16BE, for big endian, mandates the byte order specification comes prior to the character encoding in a right to left interpretation. The opposite would be UTF16LE, or little endian.

There is also the UCS, Universal Character Set. This term is still used, but it is deprecated as it is not specific enough in concern for the problem mentioned above about characters whose mapping takes more than one byte. For information about the differences between UCS and Unicode please read this: http://en.wikipedia.org/wiki/Universal_Character_Set#Differences_between_ISO_10646_and_Unicode

Some examples are the following:
IRI - RFC 3987 - http://www.ietf.org/rfc/rfc3987.txt - mandates UTF8 encoding
Mail Markup Language - http://mailmarkup.org/ - mandates UTF16BE encoding

2010-02-24 11:38:53

related questions

Does Tiles for Struts2 support UTF-8 encoded templates?

international characters in Javascript

Are you fluent in Unicode yet?

Don't repeat yourself vs Internationalisation

Stuts2 Tiles Tomcat suspected of changing UTF-8 to ?????

regex for parsing resource (.rc) files

Do you know of a good program for editing/translating resource (.rc) files?

Best javascript i18n techniques / AJAX - dates, times, numbers, currency

MySQL UTF/Unicode migration tips

Internationalized page properties in Tapestry 4.1.2

How do I put unicode characters in my Antlr grammar?

Are named entities in HTML still necessary in the age of Unicode aware browsers?

What is the "best" way to store international addresses in a database?

Local Currency String conversion VB6

Database backed i18n for java web-app

Internationalization in SSRS

Tool in Visual Studio 2008 for helping with Localization

Any good resources or advice for working with languages with different orientations? (such as Japanese or Chinese)

How can I refactor HTML markup out of my property files?

Handling timezones in storage?

Multiple languages in an ASP.NET MVC application?

Incomplete results with Turkish characters in Indexing Service

Internationalization in your projects

Localising date format descriptors

Floating Point Number parsing: Is there a Catch All algorithm?