unicode

How do I write data to disk in UTF-8 encoding in Python?

The following Python code ... html_data = urllib2.urlopen(some_url).read() f = codecs.open(filename, 'w', encoding='utf-8') f.write(html_data) f.close() ... sometimes fails with UnicodeDecodeError ... File "/.../lib/python2.6/codecs.py", line 686, in write return self.writer.write(data) File "/.../lib/python2.6/codecs.py", line 351...

Regex to match all unicode quotation marks

Is there a simple regular expression to match all unicode quotes? Or does one have to hand-code it like this: quotes = ur"[\"'\u2018\u2019\u201c\u201d]" Thank you for reading. Brian ...

C++: wide characters outputting incorrectly?

My code is basically this: wstring japan = L"日本"; wstring message = L"Welcome! Japan is "; message += japan; wprintf(message.c_str()); I'm wishing to use wide strings but I do not know how they're outputted, so I used wprintf. When I run something such as: ./widestr | hexdump The hexidecimal codepoints create this: 65 57 63 6c 6...

How to Output Unicode Strings on the Windows Console

Hello, there are already a few questions relating to this problem. I think my question is a bit different because I don't have an actual problem, I'm only asking out of academic interest. I know that Windows's implementation of UTF-16 is sometimes contradictory to the Unicode standard (e.g. collation) or closer to the old UCS-2 than to ...

Python 2.6.5 supports Unicode? How come listdir() doesn't but Python 3.1.2 does show Unicode?

Python 2.6.5 is said to support Unicode? How come listdir() doesn't in IDLE, but Python 3.1.2 does show Unicode in IDLE? (this is tested on Windows 7) The following code is the same behavior: for dirname, dirnames, filenames in os.walk('c:\path\somewhere'): for subdirname in dirnames: print (os.path.join(dirname, subdirna...

Query MS SQL for empty spaces(  or \xa0)

When exporting some data from MS SQL Server using Python, I found out that some of my data looked like computer \xa0systems which is causing encoding errors. Using SQL Management Studio the row simply appears to be double spaced: computer systems. It seems that this is the code for  : how can I query MS SQL Server within managemen...

MySQL's utf_general_ci in C#

Is there an easy way to replicate the behavior of MySQL's utf_general_ci collation in C#? In particular, given a Unicode string, I want to generate a(n ASCII?) string that can then be trivially sorted or compared, as utf_general_ci would. I found this question, which shows how to strip accents from strings, which looks like a similar b...

Can I Eliminate Extra Unicode String Calls (Delphi)

I'm using Delphi 2009. In my program, I have been working very hard to optimize all my Delphi code for speed and memory use, especially my Unicode string handling. I have the following statement: Result := Result + GetFirstLastName(IndiID, 1); When I debug that line, upon return from the GetFirstLastName function, it traces into ...

How Can I Get Around this EOutOfMemory Exception When Encoding a Very Large File?

I am using Delphi 2009 with Unicode strings. I'm trying to Encode a very large file to convert it to Unicode: var Buffer: TBytes; Value: string; Value := Encoding.GetString(Buffer); This works fine for a Buffer of 40 MB that gets doubled in size and returns Value as an 80 MB Unicode string. When I try this with a 300 MB Buffer...

Indexing and searching French text with diacritics in Lucene

I am using Lucene Search. I have uploaded french file with following content. french.txt multimédia francophone pour l'enseignement du français langue étrangère If I search for francophone then it shows file in search result. Now when I search for multimédia or français or étrangère word it does not show any result. I have tried to ...

unicode preg_replace problem in php

I've got the string $result = "bei einer Temperatur, die etwa 20 bis 60°C unterhalb des Schmelzpunktes der kristallinen Modifikation" which comes straight from a MySQL table. The table, and the php headers are both set to UTF-8 I want to strip the 'degree' symbol: http://en.wikipedia.org/wiki/Degree_symbol and replace it with the wor...

How to implement shifted variable weighting for default Unicode collation?

The default Unicode collation element table defines four-level weight elements for Unicode characters, where the first three levels define the essential part of the sort order and the fourth level is essentially the character code, which is used for tie-breaking. The section on variable weighting defines the "shifted" option (the defaul...

Python regex \w doesn't match combining diacritics?

I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics. >>> re.match("a\w\w\wz", u"aoooz", re.UNICODE) <_sre.SRE_Match object at 0xb7788f38> >>> print u"ao\u00F3oz" aoóoz >>> re.match("a\w\w\wz...

How to represent Unicode Chr Code in VB.Net String literal?

I know you can put unicode character codes in a VB.Net string like this: str = Chr(&H0030) & "More text" I would like to know how I can put the char code right into the string literal so I can use unicode symbols from the designer view. Is this even possible? ...

Python and Unicode Blocks for regex

Coming from the land of Perl, I can do something like the following to test the membership of a string in a particular unicode block: # test if string has any katakana script characters my $japanese = "カタカナ"; if ($japanese =~ /\p{InKatakana}/) { print "string has katakana" } I've read that Python does not support unicode blocks (tr...

How to get a Ruby substring of a Unicode string?

I have a field in my Rails model that has max length 255. I'm importing data into it, and some times the imported data has a length > 255. I'm willing to simply chop it off so that I end up with the largest possible valid string that fits. I originally tried to do field[0,255] in order to get this, but this will actually chop trailing ...

Regex won't find '\u2028' unicode characters.

We're having a lot of trouble tracking down the source of \u2028 (Line Separator) in user submitted data which causes the 'unterminated string literal' error in Firefox. As a result, we're looking at filtering it out before submitting it to the server (and then the database). After extensive googling and reading of other people's probl...

Adding unicode support to a library for Windows

I would like to add Unicode support to a C library I am maintaining. Currently it expects all strings to be passed in utf8 encoded. Based on feedback it seems windows usually provides 3 function versions. fooA() ANSI encoded strings fooW() Unicode encoded strings foo() string encoding depends on the UNICODE define Is there an easy w...

UnicodeEncodeError Google App Engine

I am getting the very familiar: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 24: ordinal not in range(128) I have checked out multiple posts on SO and they recommend - variable.encode('ascii', 'ignore') however, this is not working. Even after this I am getting the same error ... The stack trace: '...

Replace string that contain #0?

I use this function to read file to string function LoadFile(const FileName: TFileName): string; begin with TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite) do begin try SetLength(Result, Size); Read(Pointer(Result)^, Size); except Result := ''; Free; raise; end; Free; ...