unicode

How would you get an array of Unicode code points a .NET String?

BACKSTORY: I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail....

How to guess the encoding of a file with no BOM in .NET?

I'm using the StreamReader class in .NET like this: using( StreamReader reader = new StreamReader( "c:\somefile.html", true ) { string filetext = reader.ReadToEnd(); } This works fine when the file has a BOM. I ran into trouble with a file with no BOM .. basically I got gibberish. When I specified Encoding.Unicode it worked fine...

Java FileReader encoding issue

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrong encoded and not readable at all. Here's my environment: Windows 2003, OS encoding: CP1252 Java 5.0 My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-La...

C# and UTF-16 characters

Is it possible in C# to use UTF-32 characters not in Plane 0 as a char? string s = ""; // valid char c = ''; // generates a compiler error ("Too many characters in character literal") And in s it is represented by two characters, not one. Edit: I mean, is there a character AN string type with full unicode support, UTF-32 or UTF-8 per...

Unicode, UTF, ASCII, ANSI format differences

whatis the difference between Unicode, UTF8, UTF7,UTF16,UTF32,ASCII, ANSI code format of encoding in ASP.net In what these are helpful for programmers. ...

Difference between Big Endian and little Endian Byte order

what is the difference between Big Endian byte order and little Endian Byte order. These both are related to Unicode and UTF16 where we use this? ...

How to make Django slugify work properly with Unicode strings?

What can I do to prevent slugify filter from stripping out non-ASCII alphanumeric characters? (I'm using Django 1.0.2) cnprog.com has Chinese characters in question URLs, so I looked in their code. They are not using slugify in templates, instead they're calling this method in Question model to get permalinks def get_absolute_url(self)...

Need a list of languages that are supported completely by ASCII encoding.

I am writing an article on Unicode and discussing the advantages of this encoding scheme over outdated methods like ASCII. As part of my research I am looking for a reference that listed the languages that could be fully represented using only the characters supported by ASCII. Haven't had much luck tracking it down with Google and I t...

SQL Server - Grid Result Save As .CSV - How to output Text instead of UTF-16 (Unicode)

Can SQL Server Grid "Save As" be changed to write out an encoding that is Text instead of UTF-16? When I right click a Result Grid in SQL Server it allows for a Save As .CSV. Currently it saves the .CSV file encoded as UTF-16 (Unicode) but Excel does not open this format automatically (Excel prompts for a delimiter). To get around the p...

Can a PHP file name (or a dir in its full path) have UTF-8 characters?

I would like to access a PHP file whose name has UTF-8 characters in it. The file does not have a BOM in it. It just contains an echo statement that displays a few unicode characters. Accessing the PHP page from the browser (FireFox 3.0.8, IE7) results in HTTP error 500. There are two entries in the Apache log (file is /க.php; the let...

Does Java have methods to get the various byte order marks?

I am looking for a utility method or constant in Java that will return me the bytes that correspond to the appropriate byte order mark for an encoding, but I can't seem to find one. Is there one? I really would like to do something like: byte[] bom = Charset.forName( CharEncoding.UTF8 ).getByteOrderMark(); Where CharEncoding comes fr...

What is a minimal set of unicode characters for reasonable Japanese support?

I have a mobile application that needs to be ported for a Japanese audience. Part of the application is a custom font file that needs to be extended from only containing latin-1 characters to also containing Japanese characters. I realise that this will make it rather large, but that is not todays problem. Note that I have no control ov...

Unicode First, Previous, Next, and Last

Unicode has snowmen and chess pieces. Does it have the first (<< or |<), previous (<), next (>) and last (>> or >|) symbols? Those would be quite useful for site navigation between articles and the like. ...

How do you handle different character encodings?

I'm trying to understand the basics of practical programming around character encodings. A few things to consider: I know how to read a file whose encoding is different, and convert it to the console's encoding. But when I try to convert literal strings that appear in source code, for some reason, it doesn't always work: In IntelliJ'...

How to display japanese characters in JTextArea

There is strange behaviour of JTextArea when displaying japanese characters - I get well-known blank rectangles instead of kanji. The mostly strange thing is that JTextField displays them perfectly (in both cases I use "Tahoma" font family). Also, if I put this code: Font f = new Font("123", Font.PLAIN, 12); // This font doesn't ex...

Unexpected output of std::wcout << L"élève"; in Windows Shell

While testing some functions to convert strings between wchar_t and utf8 I met the following weird result with Visual C++ express 2008 std::wcout << L"élève" << std::endl; prints out "ÚlÞve:" which is obviously not what is expected. This is obviously a bug. How can that be ? How am I suppose to deal with such "feature" ? ...

How to check if a Java character is a currency symbol

I have to perform a check on a character variable to see whether or not it is a currency symbol. I have discovered the Character.UnicodeBlock.CURRENCY_SYMBOLS constant however I am unsure of how to use this to determine whether or not the character is in that block. If anyone has done this before help would be much appreciated. Thanks ...

How to handle unicode of an unknown encoding in Django?

I want to save some text to the database using the Django ORM wrappers. The problem is, this text is generated by scraping external websites and many times it seems they are listed with the wrong encoding. I would like to store the raw bytes so I can improve my encoding detection as time goes on without redoing the scrapes. But Django se...

How do you convert a Visual Studio project from using wide strings to ordinary strings.

When I created my visual studio project it defaulted to forcing me to use wide strings for all the functions which take character strings. MessageBox() for example, takes a LPCWSTR rather than a const char*. While I understand that it's great for multi-lingual and portable applications, it is completely unnecessary for my simple little a...

How can I do Unicode uppercase?

I have this: >>> print 'example' example >>> print 'exámple' exámple >>> print 'exámple'.upper() EXáMPLE What I need to do to print: EXÁMPLE (Where the 'a' gets its accute accent, but in uppercase.) I'm using Python 2.6. ...