tags:

views:

88

answers:

2

Hi,

I need to detect the language from a unicode widestring. I have tried using the iMultiLang2 interface and that properly works if the locale has a codepage. Some locales/languages do not have codepages and are mapped to unicode only. How can I get the lcid for those? Georgian,Hindi and many other languages do not have codepages and are unicode only collations

I am using Delphi7 Enterprise.

Would really appreciate any help

Regards

+1  A: 

I usually do not give this kind of answers but anyway You don't!. This is kind of task you cannot really solve. There are too many cases where you cannot determine the language.

BTW, The only place where I observed a feature like this was on Google Translator and I does work only if the text length is quite big and even so there is no guarantee.

Sorin Sbarnea
Thanks and looking at the reply below, you are right. Btw, I think Bing also does that.
Mode
+1  A: 

The question is based on a misunderstanding of unicode. Unicode is a way of representing writing systems, not languages. Imagine a unicode string consisting of the three code-points U+0073, U+0069, and U+006e, that is, "sin". Is it English? Is it the Spanish word for "without"? Is it "that" in any of several Scandinavian languages? Who knows.

You mention Georgian and Hindi. Georgian script (ქართული დამწერლობა) can be used to represent Georgian, of course, but also Mingrelian, Svan, and some other even rarer languages. There is no "Hindi" script, any more than there are "English" letters. As English is written in Latin letters that we inherited from our Latin-speaking forbearers, Hindi is written in Devanāgarī (देवनागरी), a beautiful script that is also used for ancient Sanskrit and modern Marathi and Nepali and dozens of other languages. And don't get me started on Chinese.

If you are pressed and have to accept a hackish near-solution, you can make approximations: "since this character is from the Devanāgarī range (U+0900–U+097F) or the Georgian ranges (U+10A0–U+10FC and U+2D00–U+2D25), I'll assume it is probably Hindi or probably Georgian." Such a method would be error-prone and vague, but you could start with the range table here.

Malvolio
Thanks you. That was what I had in mind as well. Was wondering if there was an alternative since that method is very error prone.
Mode