unicode

Data structure for string indices?

I'm looking for a data structure for string(UTF-8) indices that is highly optimized for range queries and space usage. Thanks! Elaboration: I have list of arbitrary length utf-8 strings that i need to index. I will be use only range queries. Example: I have strings - apple, ape, black, cool, dark. Query will be something like this -...

How do I filter chat messages by normalizing letter forms?

I'm filtering chat messages on a chat system where constraining strings to Latin-1 English is desirable. Users tend to use creative typing, e.g. ßòógīě§ instead of Boogies In Java, there are unicode normalization methods which can remove diacritic marks, but I'm more interested in methods of normalizing the shapes of the letters t...

Unicode File Writing and Reading in C++?

Can anyone Provide a Simple Example to Read and Write in the Unicode File a Unicode Character ? ...

UTF-8 & Unicode, what's with 0xC0 and 0x80 ?

I've been reading about Unicode and UTF-8 in the last couple of days and I often come across a bitwise comparison similar to this : int strlen_utf8(char *s) { int i = 0, j = 0; while (s[i]) { if ((s[i] & 0xc0) != 0x80) j++; i++; } return j; } Can someone clarify the comparison with 0xc0 and checking if it's the mos...

Handling UTF8 strings in C# web service

Hi, I created a simple web service client using the C# tool wsdl.exe. It works fine except for one thing. It seems that UTF8 strings returned in response are converted to ascii. Using SOAPUI I can see normal UTF8 encoded strings being returned by the web service. But when I debug the response I received the UTF8 content seems to have al...

ICU Probe All Currency Symbols

Is there a way to probe the ICU library for all UChar's representing currency symbols supported by the library? My current solution is iterating through all locales and for each locale, doing something like this: const DecimalFormatSymbols *formatSymbols = formatter->getDecimalFormatSymbols(); UnicodeString currencySymbol = formatSymbo...

How to render a standalone Unicode character (Arabic) as it would look if it was being rendered within a word?

In written Arabic, characters look differently depending on where they stand in a word. For example, the letter ta might look like this: ـثـ inside a word but look like this: ﺙ if it stands by itself. I have some Arabic text, for example: string word = والتفويض ; When I render word as a whole word it renders correctly. Now, I want to ...

How to prefix 'N' for the parameters in a store procedure for a unicode strings

How to prefix 'N' for the parameters in a store procedure for a unicode strings in c#, alternatively i am using the same procedure for the non unicode also. i need to append it only for the unicode ones kindly help. ...

Python Unicode CSV export (using Django)

Hi All, I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input. UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 200: ordinal not in rang...

OSX Application or Web App for converting text to plain text (unicode)

I am looking for ways to quickly converting blocks of text created in Word, etc. into plain text (i.e. turning right and left quotation marks into "plain text" quotation marks) for quickly transferring content to code with as few headaches as possible. I came across this: http://www.softpedia.com/get/Office-tools/Other-Office-Tools/Kei...

Best Practices for Python UnicodeDecodeError

I use Pylons framework, Mako template for a web based application. I wasn't really bother too deep into the way python handles the unicode strings. I had tense moment when I did see my site crash when the page is rendered and later I came to know that it was related to Unicode Decode error http://wiki.python.org/moin/UnicodeDecodeError ...

Unicode based programming language

This is a curiosity more than anything: Does there exist a programming language that allows variables, functions, and classes to be named using using Unicode rather than ASCII (except, of course, for special characters such as '+')? Do any popular languages have support for this? Also, related to this, if any common language supports U...

PHP equivalent of Java's Character.getNumericValue(char c)?

Is there a PHP equivalent of Java's Character.getNumericValue(char c)? ...

UnicodeEncodeError: 'latin-1' codec can't encode character

What could be causing this error when I try to insert a foreign character into the database? >>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256) And how do I resolve it? Thanks! ...

unicode support in android ndk

I have a large C/C++ library that I need to use as part of an Android NDK project. This library needs to be able to intelligently process UTF8 strings (for example, conversion to lowercase/uppercase). The library has conditional compilation to punt to an OS API to do the conversion, but there don't seem to be any Android APIs for UTF8....

Win32 Edit Control - GetText does not return final \n

I have a Win32 Edit window (i.e. CreateWindow with classname "EDIT"). Every time I add a line to the control I append '\r\n' (i.e new line). However, when I call WM_GETTEXT to get the text of the EDIT window, it is always missing the last '\n'. If I add 1 to the result of WM_GETTEXTLENGTH, it returns the correct character count, thus ...

Comment Illegal Unicode Sequences

I was once working on a Java application dealing with unicode processing - and as usual to begin with, I write some code and test it, then comment out the working code and add some new lines., and this process goes on till I find the solution The exact issue I had was commenting out illegal Unicode strings. Some unicode wasn't working ...

Characters with accents keep appearing as "�"

I'm using a simple php script to scour an RSS feed, store the scoured data to a temporary cache flat file, then display it along the side of my website. However all the characters with accents appear as "�" What is causing this and how can I fix it? Thank you! ...

How VARCHAR/CHAR manages to store/render multinational symbols in SQL Server?

I have used to read that varchar (char) is used for storing ASCII characters with 1 bute per character while nvarchar (varchar) uses UNICODE with 2 bytes. But which ASCII? In SSMS 2008 R2 DECLARE @temp VARCHAR(3); --CHAR(3) SET @temp = 'ЮЯç'; --cyryllic + portuguese-specific letters select @temp,datalength(@temp) -- results in --...

Strings and character encoding in C++

I read a few posts about best practices for strings and character encoding in C++, but I am struggling a bit with finding a general purpose approach that seems to me reasonably simple and correct. Could I ask for comments on the following? I'm inclined to use UTF-8 and UTF-32, and to define something like: typedef std::string string8;...