utf-8

Am I correctly supporting UTF-8 in my PHP apps?

I would like to make sure that everything I know about UTF-8 is correct. I have been trying to use UTF-8 for a while now but I keep stumbling across more and more bugs and other weird things that make it seem almost impossible to have a 100% UTF-8 site. There is always a gotcha somewhere that I seem to miss. Perhaps someone here can corr...

Really Good, Bad UTF-8 example test data

So we have the XSS cheat sheet to test our XSS filtering - but other than an example benign page I can't find any evil or malformed test data to make sure that my UTF-8 code can handle missbehaving data. Where can I find some good uh.. bad data to test with? Or what is a tricky sequence of chars? ...

Unicode in Jar resources

I have a Unicode (UTF-8 without BOM) text file within a jar, that's loaded as a resource. URL resource = MyClass.class.getResource("datafile.csv"); InputStream stream = resource.openStream(); BufferedReader reader = new BufferedReader( new InputStreamReader(stream, Charset.forName("UTF-8"))); This works fine on Windows, but on Lin...

urlopen, BeautifulSoup and UTF-8 Issue

I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source." isbn = 9780141187983 url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn opener = urllib2.build_opener() url_opener = opener.open(url) page = url_ope...

How do I find the length of a Unicode string in Perl?

The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this. use bytes; $ascii = 'Lorem ipsum dolor sit amet'; $unicode = 'Lørëm ípsüm dölör sît åmét'; print "ASCII: " . length($ascii) . "\n"; print "ASCII bytes: " . bytes::length($ascii) . "\n"; prin...

Encoding in UTF-8 from PHP

I am not that good with encoding but I am even falling over with the basics here. I am trying to create a file that is recognised as UTF-8 header("Content-Type: text/plain; charset=utf-8"); header("Content-disposition: attachment; filename=test.txt"); echo "test"; exit(); also tried header("Content-Type: text/plain; charset=utf-8");...

PHP 5: how to write utf-8 binary data - image - to output?

Hi, I have a Ubuntu server and PHP5, and the PHP script files, and all output are in UTF-8. I'm trying to send an image to the output stream, but just garbled chinese characters shows up in the output: $im = imagecreatetruecolor(120, 20); $text_color = imagecolorallocate($im, 233, 14, 91); imagestring($im, 1, 5, 5, 'A Simple Text Stri...

How can I identify different encodings without the use of a BOM?

I have a file watcher that is grabbing content from a growing file encoded with utf-16LE. The first bit of data written to it has the BOM available -- I was using this to identify the encoding against UTF-8 (which MOST of my files coming in are encoded in). I catch the BOM and re-encode to UTF-8 so my parser doesn't freak out. The proble...

I need help fixing Broken UTF8 encoding

I am in the process of fixing some bad UTF8 encoding. I am currently using PHP 5 and MySQL In my database I have a few instances of bad encodings that print like: î The database collation is utf8_general_ci PHP is using a proper UTF8 header Notepad++ is set to use UTF8 without BOM database management is handled in phpMyAdmin not al...

How transport data between different database encoding?

We have such a oracle database which contains "Tranditional Chinese" character and english, and the environment is : PARAMETER VALUE NLS_LANGUAGE AMERICAN NLS_TERRITORY AMERICA NLS_CURRENCY $ NLS_ISO_CURRENCY AMERICA NLS_NUMERIC_CHARACTERS ., NLS_CHARACTERSET WE8PC850 NLS_CALENDAR GREGORIAN NLS_DATE_FORMAT DD-MON-RR ...

C++ ctype facet for UTF-8 in mingw

In a project all internal strings are kept in utf-8 encoding. The project is ported to Linux and Windows. There is a need for a to_lower functionality now. On POSIX OS I could use std::ctype_byname("ru_RU.UTF-8"). But with g++ (Debian 4.3.4-1), ctype::tolower() don't recognize Russian UTF-8 characters (latin text is lowercased fine). O...

Latin letters with acute : DjangoUnicodeDecodeError

Hi, I have a problem reading a txt file to insert in the mysql db table, te sniped of this code: file contains the in first line: "aclaración" archivo = open('file.txt',"r") for line in archivo.readlines(): ....body = body + line model = MyModel(body=body) model.save() i get a DjangoUnicodeDecodeError: 'utf8' codec can't...

How to make internal processing encoding change to UTF8 in PHP?

Currently in my application the utf8 encoded data is spoiled by internal coding of PHP. How to make it consistent with utf8? EDIT:To show examples,please tell me how to output the current internal encoding in PHP? In php.ini I found the following: default_charset = "iso-8859-1" Which means Latin1. How to change it to utf8,say,what...

What's the code page of utf8?

My cmd promt's default code page is 936. I need to change it to utf8. chcp 65001 The above doesn't work,what's the correct one? ...

string issue of utf-8 encoding with PHP and MySQL?

This is what I tried so far,by modifying php.ini: default_charset = "utf-8" This is how MySQL is configured: mysql> show variables like '%char%'; +--------------------------+-----------------------------------------------+ | Variable_name | Value | +--------------------------+-----...

How do I print UTF-8 from c++ console application on Windows

For a C++ console application compiled with Visual Studio 2008 on English Windows (XP,Vista or 7). Is it possible to print out to the console and correctly display UTF-8 encoded Japanese using cout or wcout? ...

Character Encoding Issue - Strange Behaviour From Pound Signs (£) with UTF-8 IE6 / ASP / XML

Hi folks, I am having a very strange problem with pound signs displaying incorrectly (or not at all) on a web page. I am keying text in a textbox, which then gets (briefly) stored in XML before being displayed in a new IE(6) window. The worst part is that this is inconsistent. I have three different things happening: 1. Pound sign doe...

Handling special characters in C (UTF-8 encoding)

Hi! I'm writing a small application in C that reads a simple text file and then outputs the lines one by one. The problem is that the text file contains special characters like Æ, Ø and Å among others. When I run the program in terminal the output for those characters are represented with a "?". Is there an easy fix? ...

Converting Composite Bytes to Unicode in MySQL

I have a MySQL database that I recently migrated to another server. Unfortunately, MySQL dumps its data in Latin1 with any UTF-8 characters represented by composite bytes (ex. – instead of —). Is it possible to run a simple query or script that would convert these composite bytes to UTF-8 within my tables? It's impossible to do it row...

 character (UTF-8 BOM) in middle of ASP.NET response due to HttpResponse.TransmitFile()

I've seen this post:  characters appended to the begining of each file. In that case, the author was manually reading the source file and writing the contents. In my case, I'm abstracting it away via HttpRequest.TransmitFile(): public void ProcessRequest(HttpContext context) { HttpRequest req = context.Request; HttpResponse ...