utf-8

Are 6 octet UTF-8 sequences valid?

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range. (All quotes are from RFC 3629) Section 3: In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 ...

Django utf-8 and django-mailer strangeness

Latest django mailer from trunk http://github.com/jtauber/django-mailer/tree/master/docs/ Tested with Postgresql 8.4, sqlite3 template {{ title }} forms.py #-*- coding: utf-8 -*- if "mailer" in settings.INSTALLED_APPS: from mailer import send_mail else: from django.core.mail import send_mail ... body_txt = render...

How can I convert HTML character references (ף) to regular UTF-8?

Hello. I have some hebrew websites that contains character references like: נוף I can only view these letters if I save the file as .html and view in UTF-8 encoding. If I try to open it as a regular text file then UTF-8 encoding does not show the proper output. I noticed that if I open a text editor and write hebrew...

AS2 Flash Input Text Problem

Hi, Well I've big problem and no idea how to sort it out. I've made a form in Flash using input text fields. The point is i'm Polish so our customers expect so they can put a polish character inside the input text field ( ex. źćż etc ). The problem occurs on WebKit Engine Browsers ( Safari, Chrome ) which just put normal characters ( l ...

Python regex against Latin-1 character encoding?

I have a file which contains (I believe) latin-1 encoding. However, I cannot match regexes against this file. If I cat the file, it looks fine: However, I cannot find the string: In [12]: txt = open("b").read() In [13]: print txt <Vw_IncidentPipeline_Report> In [14]: txt Out[14]: '\x00 \x00 \x00<\x00V\x00w\x00_\x00I\x00n\x0...

Why do i have to use set_charset("utf8") even though everything is utf-8 encoded? (MySQLi-PHP)

My table's collation is utf8_general_ci. My pages are encoded with UTF-8 (without BOM). Within my pages, my Equiv meta tag sets character set to utf8 My data has Turkish characters in it. When i output them, it's not showing them as it should be but when i do $db->set_charset("utf8");, it works. Why do i have to use $db->set_charset...

Remove BOM from page output via web.config

Currently our pages are being output with the Unicode BOM. I have found one way of removing this by adding the following to my masterpage's OnInit. Response.ContentEncoding = System.Text.UTF8Encoding(false); Where the false being passed to the UTF8Encoding constructor disables the BOM. This works fine, but I'd prefer to set this in...

META value charset=UTF-8 prevents UTF-8 characters showing.

I've made a test program that is basically just a textarea that I can enter characters into and when I click submit the characters are written to a MySQL test table (using PHP). The test table is collation is UTF-8. The script works fine if I want to write a é or ú to the database it writes fine. But then if I add the following meta s...

Reg ex validation on UTF8 / multi byte 'language' characters (inc chinese etc) but not special characters such as {/*

Hi, Using PHP / Mysql all encoded up as UTF we have recently had to start capturing non-Latin characters such as Chinese etc. We have php validation that checks on string length and alpha numeric such as if (!ereg("[[:alnum:]]{2,}",$_POST['company_name'])) { //error code here } This is not working on multi byte chars. I understand ...

php: using DomDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it.

Hello. When I try to write UTF-8 Strings into an XML file using DomDocument it actually writes the hexadecimal notation of the string instead of the string itself. for example: &#x5D9;&#x5E8;&#x5D5;&#x5E9;&#x5DC;&#x5D9;&#x5DD; instead of: ירושלים any ideas how to resolve the issue? ...

Substitute for u'string'

I saved my script in UTF-8 encoding. I changed my codepage on windows to 65001. I'm on python 2.6 Script #1 # -*- coding: utf-8 -*- print u'Español' x = raw_input() Script #2 # -*- coding: utf-8 -*- a = 'Español' a.encode('utf8') print a x = raw_input() Script #1, prints the word fine with no errors, Script #2 does error: Un...

How to print japanese utf-8 on console in windows?

#coding=<utf8> import os os.popen('chcp 65001') a = 'こんにちは世界' print a.decode('utf8') x = raw_input() PYTHON 2.6 on Windows 7 It will run in IDLE with no errors. However when run from the console, it errors and flashes very quickly and I can't read the error message. How can it be done in windows console? By the way, doing this wit...

How to display utf-8 in windows console

I'm using Python 2.6 on Windows 7 I borrowed some code from here: http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console My goal is to be able to display uft-8 strings in the windows console. Apparantly in python 2.6, the sys.setdefaultencoding() is no longer supported However, I wrote reload(sys) ...

Flush the console screen with special unicode class workaround for windows console

I"m trying to make a simple text progress bar in windows console and also display utf8 characters. The problem isn't that the unicode characters won't display, they will. It's that in order to make the unicode characters display I used a class to tell sys.stdout what to do. This has interfered with the normal flush() function. How can ...

Accent-insensitive substring matching

I have a search functionality that obtains data from an InnoDB table (utf8_spanish_ci collation) and displays it in an HTML document (UTF-8 charset). The user types a substring and obtains a list of matches where the first substring occurrence is highlighted, e.g.: Matches for "AL": Álava <strong>Al</strong>bacete <strong>Al</strong>me...

Optimized regex for N words around a given word (UTF-8)

I'm trying to find an optimized regex to return the N words (if available) around another one to build a summary. The string is in UTF-8, so the definition of "words" is larger than just [a-z]. The string that serves as the reference word could be in the middle of a word or not directly surrounded by spaces. I've already got the followi...

PHP and character encoding problem with  character

I'm having a problem where PHP (5.2) cannot find the character 'Â' in a string, though it is clearly there. I realize the underlying problem has to do with character encoding, but unfortunately I have no control over the source content. I receive it as UTF-8, with those characters already in the string. I would simply like to remove it...

Russian language in e-TextEditor or Cygwin

I'm using e-TextEditor for some tasks and can't figure, why when i using some Russian text and process it in bundle script i'm always getting something like http://gyazo.com/f38c69babe1f95ff786711fe684aee77.png . I'm think this is cygwin bug, because webkit must render it correct in UTF-8 encoding. I'm tested some guides thats describes ...

perl2php, +mysql string encoding (+CodeIgniter)

Seriously, I'm lost in the UTF-8 world. Here is my situation (everything is happening on a mac): I get a web service response with perl+lwp and store it in mysql database; response is encoded in UTF-8 and I use DBI module to store data in Mysql UTF-8 table (urf8_general_ci encoding); when I get strings from database with CI model; outp...

How to make python 3 print() utf8

How to make python 3 (3.1) to print("Some text") to stdout in utf8 ... or how to output raw bytes.. Test.py TestText = "Test - āĀēĒčČ..šŠūŪžŽ" # this is UTF-8 TestText2 = b"Test2 - \xc4\x81\xc4\x80\xc4\x93\xc4\x92\xc4\x8d\xc4\x8c..\xc5\xa1\xc5\xa0\xc5\xab\xc5\xaa\xc5\xbe\xc5\xbd" # just bytes print(sys.getdefaultencoding()) prin...