unicode

Unicode "end of story"

I'm looking for a good character that means "end-of-story" in unicode. I remember seeing one once that looked like a fractal and was really cool. Does anyone know where I can find this character? More importantly, where can I go to find a unicode character with a special meaning when I don't know it's names? Google wasn't very helpful. ...

How expensive is java's string encoding conversion?

I was wondering how expensive Java's string encoding conversion algorithms are, say, for a piece of text is in EBCDIC that needs to be converted to UTF-16, or for a similar conversion of a large file. Are there any benchmarks on the cost of this conversion? Benchmarks for multiple encodings would be better. ...

Can I make git recognize a UTF-16 file as text?

I'm tracking a Virtual PC virtual machine file (*.vmc) in git, and after making a change git identified the file as binary and wouldn't diff it for me. I discovered that the file was encoded in UTF-16. Can git be taught to recognize that this file is text and handle it appropriately? I'm using git under Cygwin, with core.autocrlf set ...

How can I change a file's encoding with vim?

I'm used to using vim to modify a file's line endings: $ file file file: ASCII text, with CRLF line terminators $ vim file :set ff=mac :wq $ file file file: ASCII text, with CR line terminators Is it possible to use a similar process to change a file's unicode encoding? I'm trying the following, which doesn't work: $ file file.xml f...

problem opening a text document - unicode error

hello, i have probably rather simple question. however, i am just starting to use python and it just drives me crazy. i am following the instructions of a book and would like to open a simple text file. the code i am using: import sys try: d = open("p0901aus.txt" , "W") except: print("Unsucessfull") sys.exit(0) i am either getting...

How do I diff utf-16 files with GNU diff?

GNU diff doesn't seem to be smart enough to detect and handle UTF-16 files, which surprises me. Am I missing an obvious command-line option? Is there a good alternative? ...

Displaying Unicode characters above U+FFFF on Windows

Hi, the application I'm developing with EVC++ 4 runs on Windows CE 5 and should support unicode (AFAIK wchar_t uses UTF-16 on windows, so I'm using that), so I want to be able to test it with "more exotic" characters. Especially with characters that use 4 Byte in UTF-16 and not just 2. Therefore I'm trying to display such characters in ...

Are digits represented in sequence in all text encodings?

This question is language agnostic but is inspired by these c/c++ questions. How to convert a single char into an int Char to int conversion in C Is it safe to assume that the characters for digits (0123456789) appear contigiously in all text encodings? i.e. is it safe to assume that '9'-'8' = 1 '9'-'7' = 2 ... '9'-'0' = 9 in all...

Storing Currency Symbols in a Database Table

We are using firebird as our database. How do we go about storing currency symbols in the database. Which character set should we use or what is generally best practice? For example storing "$" or "¥" appears straight forward but more complex symbols do not appear correctly in the database table, i.e. "₡" will not store in the database....

Testing for Japanese/Chinese Characters in a string

I have a program that reads a bunch of text and analyzes it. The text may be in any language, but I need to test for japanese and chinese specifically to analyze them a different way. I have read that I can test each character on it's unicode number to find out if it is in the range of CJK characters. This is helpful, however I would l...

unicode hello world for C?

I am trying to output things like 안, 蠀, ☃ from C #include <wchar.h> int main() { fwprintf(stdout, L"안, 蠀, ☃\n"); return 0; } output is ?, ?, ? How do I print those characters? Edit: #include <wchar.h> #include <locale.h> int main() { setlocale(LC_CTYPE, ""); fwprintf(stdout, L"안, 蠀, ☃\n"); return 0; } this ...

Python interface to PayPal - urllib.urlencode non-ASCII characters failing

I am trying to implement PayPal IPN functionality. The basic protocol is as such: The client is redirected from my site to PayPal's site to complete payment. He logs into his account, authorizes payment. PayPal calls a page on my server passing in details as POST. Details include a person's name, address, and payment info etc. I need t...

Can't decode utf-8 string in python on os x terminal.app

I have terminal.app set to accept utf-8 and in bash I can type unicode characters, copy and paste them, but if I start the python shell I can't and if I try to decode unicode I get errors: >>> wtf = u'\xe4\xf6\xfc'.decode() Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't e...

Unicode version supported by Java 6

Anyone know the answer? According to http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp, it's 4.0 for 5. Has it been upgraded in 6? Link to reference would be much appreciated as well. ...

Best way to convert a Unicode URL to ASCII (UTF-8 percent-escaped) in Python?

I'm wondering what's the best way -- or if there's a simple way with the standard library -- to convert a URL with Unicode chars in the domain name and path to the equivalent ASCII URL, encoded with domain as IDNA and the path %-encoded, as per RFC 3986. I get from the user a URL in UTF-8. So if they've typed in http://➡.ws/♥ I get 'htt...

Any gotchas using unicode_literals in Python 2.6?

We've already gotten our code base running under Python 2.6. In order to prepare for Python 3.0, we've started adding: from __future__ import unicode_literals into our .py files (as we modify them). I'm wondering if anyone else has been doing this and has run into any non-obvious gotchas (perhaps after spending a lot of time debugg...

How do I match a Russian word in Unicode text using Perl?

I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+ doesn't work and matching with \p{L}+ retrieves everything. How do I do it? ...

How to correctly display Japanese RTF Fonts

I am working on an application in Delphi 2009 which makes heavy use of RTF, edited using TRichEdit and TLMDRichEdit. Users who entered Japanese text in these RTF controls have been submitting intermittent reports about the Japanese text being displayed as gibberish when reloading the content, both on Win XP and Vista, with Eastern Langua...

What utf format should boost wdirectory_iterator return?

If a file contains a £ (pound) sign then directory_iterator correctly returns the utf8 character sequence \xC2\xA3 wdirectory_iterator uses wide chars, but still returns the utf8 sequence. Is this the correct behaviour for wdirectory_iterator, or am I using it incorrectly? AddFile(testpath, "pound£sign"); wdirectory_iterator iter(test...

string encodings in python

Hello. In python, strings may be unicode ( both utf-16 and utf-8 ) and single-byte with different encodings ( cp1251, cp1252 etc ). Is it possible to check what encoding string is? For example, time.strftime( "%b" ) will return a string with text name of a month. Under MacOS returned string will be utf-16, under Windows with English ...