unicode

Unicode URL decoding

The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61) But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 ("Aa")? Are 8-bit characters, that require encoding, preceded by %00? Or, is the point that unicode characters are supposed to be l...

Open an ANSI file and Save a a Unicode file using Delphi

For some reason, lately the *.UDL files on many of my client systems are no longer compatible as they were once saved as ANSI files, which is no longer compatible with the expected UNICODE file format. The end result is an error dialog which states "the file is not a valid compound file". What is the easiest way to programatically op...

Exporting MSAccess Tables as Unicode with Tilde delimiter

I want to export the contents of several tables from MSAccess2003. The tables contain unicode Japanese characters. I want to store them as tilde delimited text files. I can do this manually using File/Export and, in the 'Advanced' dialog selecting tilde as Field Delimiter and the Unicode as the Code Page. I can store this as an Export...

Switching from std::string to std::wstring for embedded applications?

Up until now I have been using std::string in my C++ applications for embedded system (routers, switches, telco gear, etc.). For the next project, I am considering to switch from std::string to std::wstring for Unicode support. This would, for example, allow end-users to use Chinese characters in the command line interface (CLI). What ...

UTF-8 in Windows

How do I set the code page to UTF-8 in a C Windows program? I have a third party library that has uses fopen to open files. I can use wcstombs to convert my Unicode filenames to the current code page, however if the user has a filename with a character outside the code page then this breaks. Ideally I would just call _setmbcp(65001...

Chinese Characters displaying in IE7+

The Problem: Chinese characters aren't displaying correctly in IE7+. They are displaying in Firefox 3, Chrome, Opera 9.5, and IE6. Example: Transportation Scroll down to the footer on the page, click on "Translate This page" and the second option in the select box should be the Chinese characters. ...

Avoiding code change with Microsoft SQLServer and Unicode

How can you get MSSQL server to accept Unicode data by default into a VARCHAR or NVARCHAR column? I know that you can do it by placing a N in front of the string to be placed in the field but to by quite honest this seems a bit archaic in 2008 and particuarily with using SQL Server 2005. ...

How to generate pdf files _with_ utf-8 multibyte characters using Zend Framework

Hello, I've got a "little" problem with Zend Framework Zend_Pdf class. Multibyte characters are stripped from generated pdf files. E.g. when I write aąbcčdeę it becomes abcd with lithuanian letters stripped. I'm not sure if it's particularly Zend_Pdf problem or php in general. Source text is encoded in utf-8, as well as the php source...

How do I convert a file's format from Unicode to ASCII using Python?

I use a 3rd party tool that outputs a file in Unicode format. However, I prefer it to be in ASCII. The tool does not have settings to change the file format. What is the best way to convert the entire file format using Python? ...

How do I convert Word smart quotes and em dashes in a string?

I have a form with a textarea. Users enter a block of text which is stored in a database. Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,†What function should I call on the input string to convert smart quotes to regular quotes and emdashes...

[C++] UTF-8 to ASCII using ICU Library

I have a std::string with UTF-8 characters in it. I want to convert the string to its closest equivalent with ASCII characters. For example: Łódź => Lodz Assunção => Assuncao Schloß => Schloss Unfortunatly ICU library is realy unintuitive and I haven't found good documentation on its usage, so it would take me too much time to l...

The encoding 'UTF-8' is not supported by the Java runtime.

Whenever I start our Apache Felix (OSGi) based application under SUN Java ( build 1.6.0_10-rc2-b32 and other 1.6.x builds) I see the following message output on the console (usually under Ubuntu 8.4): Warning: The encoding 'UTF-8' is not supported by the Java runtime. I've seen this message display occasionally when running both T...

Case-insensitive UTF-8 string collation for SQLite (C/C++)

I am looking for a method to compare and sort UTF-8 strings in C++ in a case-insensitive manner to use it in a custom collation function in SQLite. The method should ideally be locale-independent. However I won't be holding my breath, as far as I know, collation is very language-dependent, so anything that works on languages other than...

Finding the end of a substring match in .NET

I am trying to find the index of a substring in a string that matches another string under a specific culture (provided from a System.CultureInfo). For example the string "ass" matches the substring "aß" in "straße" under a German culture. I can find the index of the start of the match using culture.CompareInfo.IndexOf(value, substr...

MySQL - Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation 'UNION'

How do I fix that error once and for all? I just want to be able to do unions in MySQL. (I'm looking for a shortcut, like an option to make MySQL ignore that issue or take it's best guess, not looking to change collations on 100s of tables ... at least not today) ...

Can anyone recommend a good, free javascript for punycode to Unicode conversion?

I found this the other day: http://0xcc.net/jsescape/ but the punycode conversion doesn't work if there's a dash in the middle. For instance - I need to convert the punycode NIATO-OTABD to nñiñatoñ. Any help much appreciated ...

How do convert unicode escape sequences to unicode characters in a .NET string

Say you've loaded a text file into a string and you'd like to convert all unicode escapes into actual unicode characters inside of the string. Example: "The following is the top half of an integral character in unicode '\u2320', and this is the lower half '\U2321'." I found an answer that works for me and if follows. ...

What do these Unicode characters (codepoints) mean in this regex?

I have the following regular expression : I figured out most of the part which is as follows : ValidationExpression="^[\u0020\u0027\u002C\u002D\u0030-\u0039\u0041-\u005A\u005F\u0061-\u007A\u00C0-\u00FF°./]{1,256}$" u0020 : SPACE u0027 : APOSTROPHE u002C : COMMA u002D : HYPHEN / MINUS u0030-\u0039\ : 0-9 u0041-\u005A : A - Z u005F : UN...

MS Office hyperlinks change code page ?!?

When you paste the following URL into IE: http://technet.microsoft.com/en-us/sysinternals/bb897434.aspx, the link on the right of the page cleanly says "Download Zoomit (77 KB)". If you paste the link into an Office document (Word, Excel, PowerPoint -- tested using Office 2003), and activate the link from the document, that same text ha...

Java, Alfresco Web Service API and Unicode NamedValues

I'm using Java for accessing Alfresco content server via it's web service API for importing some content into it. Content should have some NamedValue properties set to UTF-8(cyrillic) string. I keep getting the Sax parser exception: org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1b) was found in the element content ...