utf-16

Should UTF-16 be considered harmful?

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?" Why do I ask this question? How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate ...

Bug with Python UTF-16 output and Windows line endings?

With this code: test.py import sys import codecs sys.stdout = codecs.getwriter('utf-16')(sys.stdout) print "test1" print "test2" Then I run it as: test.py > test.txt In Python 2.6 on Windows 2000, I'm finding that the newline characters are being output as the byte sequence \x0D\x0A\x00 which of course is wrong for UTF-16. Am I...

PHP UTF-16 to ASCII conversion

Consider the following string. Its encoded in UTF-16-LE and saved into a PHP variable. I failed to get either mbstring or iconv to replace the ' with single quote. What would be a good way to sanatize it. String : Carl Sagan's Cosmic Connection ...

How can I check for the existence of UTF-16 filenames in Perl?

I have a textfile encoded in UTF-16. Each line contains a number of columns separated by tabs. For those who care, the file is a playlist TXT export from iTunes. Column #27 contains a filename. I am reading it using Perl 5.8.8 in Linux using code similar to: binmode STDIN, ":encoding(UTF-16)"; while(<>) { chomp; my @cols = s...

What is the easiest way to search and replace in text files encoded UTF-16?

I'm trying to update a series of xml files by changing names that they reference. I have a table of names that have changed, column for the current name and a column for the name to replace with. I looked for ways to script search and replace and found sed. It seemed like a good choice until I ran my first attempt. On inspecting the fi...

How can I identify different encodings without the use of a BOM?

I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The proble...

Printing Astral Plane Unicode code point to console using int

Please see here for a related question. However, char goes to 0xffff (or 65535). I need to write 0xd800df46 (or 66374), Gothic letter Faihu, so casting that int to char will not work. I do the conversion ok, that is, I get the correct integer, meaning I calculate the surrogate pairs ok, but I don't know how to "render" it, convert it t...

Dummy's guide to Unicode

Could anyone give me a concise definitions of Unicode UTF7 UTF8 UTF16 UTF32 Codepages How they differ from Ascii/Ansi/Windows 1252 I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer. ...

UTF-16 to ASCII conversion in Java

Hi, Having ignored it all this time, I am currently forcing myself to learn more about unicode in Java. There is an exercise I need to do about converting a UTF-16 string to 8-bit ASCII. Can someone please enlighten me how to do this in Java? I understand that you can't represent all possible unicode values in ASCII, so in this case...

Why does ContentResult controller in ASP.NET MVC return UTF-16 when UTF-8 specified?

I have an ActionResult that returns XML for an embedded device. The relevant code is: return Content(someString, "text/xml", Encoding.UTF8); Even though UTF-8 is specified, the resulting XML is: <?xml version="1.0" encoding="utf-16"?> The ASP.NET MVC is compiled as AnyCPU and runs on a Windows 2008 server. Why is it not returni...

Tcl for getting ASCII code for every character in a string

I need to get the ASCII character for every character in a string. Actually its every character in a (small) file. The following first 3 lines successfully pull all a file's contents into a string (per this recipe): set fp [open "store_order_create_ddl.sql" r] set data [read $fp] close $fp I believe I am correctly discerning the ASC...

detect UTF-16 file content

Is it possible to know if a file has unicode (16-byte per char) or 8-bit ASCII content ? ...

Is it correct to write to a database which has 'NLS_CHARACTERSET' and 'NLS_NCHAR_CHARACTERSET' parameter values AL32UTF8 and UTF-8 with UTF-16 code page values?

The value of parameters 'NLS_CHARACTERSET' and 'NLS_NCHAR_CHARACTERSET' is UTF-8 for source database from where i am reading data, and AL32UTF8 and UTF-8 for target database where i am writing data. I am reading data from a text file which has english, european and asian characters, I am using UTF-16 code page to read from source flat fi...

MSXMLWriter60 doesn't output byteOrderMark for UTF-16 encoding

I'm using a variant on code seen in "How to make XMLDOMDocument include the XML Declaration?" (which can also be seen at MSDN. If I change the encoding to "UTF-16" one would think it would output as UTF-16... and it "does"... by looking at the output in a text editor; but checking it in a hex editor, the byte-order mark is missing (despi...

Using JNA to get/set application identifier

Following up on my previous question concerning the Windows 7 taskbar, I would like to diagnose why Windows isn't acknowledging that my application is independent of javaw.exe. I presently have the following JNA code to obtain the AppUserModelID: public class AppIdTest { public static void main(String[] args) { NativeLibrar...

Valid Locale Names

How do you find valid locale names? I am currently using MAC OS X. But information about other platforms would also be useful. #include <fstream> #include <iostream> int main(int argc,char* argv[]) { try { std::wifstream data; data.imbue(std::locale("en_US.UTF-16")); data.open("Plop"); } catch...

UTF-16 codecvt facet

Extending from this questions about locales And described in this question: What I really wanted to do was install a codecvt facet into the locale that understands UTF-16 files. I could write my own. But I am not a UTF expert and as such I am sure I would get it nearly correct; but it would break at the most inconvenient time. So I was ...

Java implicit conversion of int to byte

I am about to start working on something the requires reading bytes and creating strings. The bytes being read represent UTF-16 strings. So just to test things out I wanted to convert a simple byte array in UTF-16 encoding to a string. The first 2 bytes in the array must represent the endianness and so must be either 0xff 0xfe or 0xfe...

Search or compare within a Grapheme Cluster in Korean

In my current implementation of a UISearchBarController I'm using [NSString compare:] inside the filterContentForSearchText:scope: delegate method to return relevant objects based on their name property to the results UITableView as you start typing. So far this works great in English and Korean, but what I'd like to be able to do is se...

Extracting UTF-16 encoded file from ZIP archive in Java

In the last section of the code I print what the Reader gives me. But its just bogus, where did I go wrong? public static void read_impl(File file, String targetFile) { // Create zipfile input stream FileInputStream stream = new FileInputStream(file); ZipInputStream zipFile = new ZipInputStream(new BufferedInputStream(stream...