unicode

php + vim - बंगलौर (Bangalore) has a break before the last character र

I used http://translate.google.com/#en|hi|Bangalore to get the Hindi for Bangalore and बंगलौर. But when I pasted it in vim there is a break before the last character र. I am using preg_replace with the regex pattern /[^\p{L}\p{Nd}\p{Mn}_]/u for matching words. But this is treating the last character as a separate word. This is my input...

php mb_strtolower giving invalid character

The following code is creating problem. var_dump($name); $name = mb_strtolower($name); var_dump($name); Output is string(32) "brazil and technology, São Paulo" string(32) "brazil and technology, s�o paulo" Can someone please explain why I am getting an invalid character for ã? What am I doing wrong here? mb_detect_encoding($nam...

Qt - Converting QString to Unicode QByteArray

I have a client-server application where client will be in Qt(Ubuntu) and server will be C#. Qt client willsend the strings in UTF-16 encoded format. I have used the QTextCodec class to convert to UTF-16. But whenever the conversion happens it will be padded with some more characters. For example "<bind endpoint='2_3'/>" will be chang...

Length of a unicode string

In my Rails (2.3, Ruby 1.8.7) application, I need to truncate a string to a certain length. the string is unicode, and when running tests in console, such as 'א'.length, I realized that a double length is returned. I would like an encoding-agnostic length, so that the same truncation would be done for a unicode string or a latin1 encoded...

Python regex with unicode characters bug?

Long story short: >>> re.compile(r"\w*").match(u"Français") <_sre.SRE_Match object at 0x1004246b0> >>> re.compile(r"^\w*$").match(u"Français") >>> re.compile(r"^\w*$").match(u"Franais") <_sre.SRE_Match object at 0x100424780> >>> Why doesn't it match the string with unicode characters with ^ and $ in the regex? As far as I understand ...

File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles() and related methods seem to return file names in a different encoding than the rest of the system. Note that it is not merely the display of these file names that is causing me problems. I'm mainl...

How to use Unicode symbols on webpages?

I'm using some Unicode symbols on a webpage I'm making. For purposes of this example, let's say it's this guy: '☺'. As I understand it, under the correct implementation of CSS, you can set any font you want, and if it runs into a character that is not present in that font, it will start falling back through the font-family backup choice...

How can my program switch from ASCII to Unicode?

I want to write a program in C++ that should work on Unix and Windows. This program should be able to use both: the Unicode and non Unicode environments. Its behavior should depend only on the environment settings. One of the nice features that I want to have, is to manipulate file names read from directories. These can be unicode... or...

How to handle unicode character sequences in C/C++ ?

What are the more portable and clean ways to handle unicode character sequences in C and C++ ? Moreover, how to: -Read unicode strings -Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII) -Print unicode strings Should I use the environment too ? I've read about LC_CTYPE for example, should I care abou...

Enabling Unicode Support in Solr

I want to enable the Unicode in Solr. Updating the index does not give me an error. But as soon as I try to search some Chinese text, I get an error. I have added the following line to my schema. <filter class="solr.CollationKeyFilterFactory" language="" strength="primary"/> and now I am getting following exception. org.apache.so...

Is it possible to show unicode characters in a HTML input type=submit value?

A customer has a designer mockup of a form button that shows a right facing triangle character after the main text, but I can't seem to get this showing. The offending markup is; <input type="submit" value="Add to basket &#9654;" /> This should look like 'Add to basket ▶' (if it renders in your browser). Is this possible or am I doin...

Unicode error with Solr, any idea?

I have an index that takes textual description and places it in the index. I build XML object to pass them on to Solr where indexing is done. Now when I search in chinese text, I get back question marks for the indexed text whose XML was fine. Any idea where the problem could be? Thanks ...

Compete understanding of encodings and character sets

Can anybody tell me where to find some clear introduction to character sets, encodings and everything releted to these things? Thanks! ...

Reading unicode character in java

I'm a bit new to java, When I assign a unicode string to String str = "\u0142o\u017Cy\u0142"; System.out.println(str); final StringBuilder stringBuilder = new StringBuilder(); InputStream inStream = new FileInputStream("C:/a.txt"); final InputStreamReader streamReader = new InputStreamReader(inStream, "UTF-8"); final Buffe...

Degrading Unicode characters for web browsers with missing fonts

I am using the Unicode 'CHECK MARK' (U+2713) in a html document. I find that it renders OK in most browsers, but occasionally I encounter someone with a missing font on their PC. Are there any HTML / JS tricks to specify an alternative display character (or an image) if the font is missing? ...

MySQL unicode literals

I want to insert a record into MySQL that has a non-ASCII Unicode character, but I'm on a terminal that doesn't let me easily type non-ASCII characters. How do I escape a Unicode literal in MySQL's SQL syntax? ...

Can I set CharSet for every page load? (Classic ASP)

I have made some changes to a Classic ASP application which breaks foreign letters unless "Response.Charset = "utf-8"" is set in every page... And it's a lot of pages... Could I force the Charset to utf-8 for every page without having to set it in each page? ...

How to know the preferred display width of Unicode characters?

In different encodings of Unicode, for example UTF-16le or UTF-8, a character may occupy 2 or 3 bytes. Many Unicode applications doesn't take care of display width of Unicode chars just like they are all Latin letters. For example, in 80-column text, which should contains 40 Chinese characters or 80 Latin letters in one line, but most ap...

What is the point of unicode escape sequences in identifier names in JavaScript?

JavaScript allows for having unicode escape sequences in identifier names... for example: var \u0160imeVidas = "blah"; The above variable starts with the (croatian) letter Š, so that the complete name of the variable is "ŠimeVidas". Now, this is neat, but what's the point? Is there any scenario where this feature may be of any use? -...

[Actionscript 3] Get the byte length of a string

Is there an easy way to get the byte length of a string in AS3? String.length works in many cases, but breaks if it encounters mulibyte unicode characters. (in this particular case, I need to know this so I can preface messages sent across a TCP socket with the message length. This is in standard netstring format e.g. "length:message,")...