utf-8

looking for samples to validate UTF-8

Hello everyone, Suppose I have a byte stream (array), and I want to write code (using .Net C#) to validate whether it is valid UTF-8 byte sequence or not. I want to write code from scratch because I need to report the exact location where there is invalid byte sequences and may even remove invalid bytes -- not just want to get yes or no...

International Fonts Display Issue with UTF-8

Hi We have developed a PHP-MySQL application in two languages - English and Gujarati. The Gujarati language contains symbols that need unicode UTF-8 encoding for proper display. The application runs perfectly on my windows based localhost and on my Linux based testing server on the web. But when I transfer the application to the clie...

Failsafe conversion between different character encodings

Hello! I need to convert strings from one encoding (UTF-8) to another. The problem is that in the target encoding we do not have all characters from the source encoding and libc iconv(3) function fails in such situation. What I want is to be able to perform conversion but in output string have this problematic characters been replaced w...

UTF-8 only in Grails 1.1 database tables

When using Grails 1.1 together with a MySQL the charsets of the auto-generated database tables seem to default to ISO-8859-1. I'd rather have everything stored as pure UTF-8. Is that possible? From the auto-generated database definitions: ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=latin1; Note the "latin1" part. A work-around th...

decode a file stream using UTF-8

Hello everyone, I have an input file and it is very big (about 120M), and I do not want to load it into memory at once. My purpose is to check whether this file is using valid UTF-8 encoding encoded file. Any ideas to have a quick check without reading all file content into memory in the form of byte[]? Simple sample code appreciated. ...

Make Emacs use UTF-8 with Python Interactive Mode

When I start Python from Mac OS' Terminal.app, python recognises the encoding as UTF-8: $ python3.0 Python 3.0.1 (r301:69556, May 18 2009, 16:44:01) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.stdout.encoding 'UTF-8' This works the same fo...

Reading a UTF8 CSV file with Python

I am trying to read a CSV file with accented characters with Python (only French and/or Spanish characters). Based on the Python 2.5 documentation for the csvreader (http://docs.python.org/library/csv.html), I came up with the following code to read the CSV file since the csvreader supports only ASCII. def unicode_csv_reader(unicode_csv...

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII). Since PHP doesn't directly handle UTF-8, I'm using: $value = iconv ('UTF-8', 'ISO-8859-1', $value)...

What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C. Some interesting examples: The heart character. If I type this into my browser: http://www.google.com/search?q=♥ Then copy and paste it, I see this URL http://www.google.com/search?q=%E2%99%A5 which mak...

Classic ASP - How to convert string to UTF8 to USC2

I have a problem where I am storing a UTF8 string in SQL Server as USC2, when I pull it out to display on a page with content-type set to UTF-8 it works fine. But I have a third party javascript component which when I pass it the string for the database it renders it as USC2. or not UTF8. Is there a way in ASP to convert this string to ...

How to make a flex (lexical scanner) to read UTF-8 characters input?

It seems that flex doesn't support UTF-8 input. Whenever the scanner encounter a non-ASCII char, it stops scanning as if it was an EOF. Is there a way to force flex to eat my UTF-8 chars? I don't want it to actually match UTF-8 chars, just eat them when using the '.' pattern. Any suggestion? EDIT The most simple solution would be: ...

Switching all aspx files from local encoding to utf-8

Hello, how can I save all files in a directory using utf-8? There is a need to change the default file encoding in IIS to display all foreign characters correct. The problem is: all old files are saved in (different/random) encodings. Is there a way to open (in current) and save all those files safely to UTF-8? ...

Encoding problems in JSP

I have an html-form with several text fields. When I try to submit not English characters (Russian in my case) server is received "unreadable" string (not questions - "???" but some strange characters). I simplified my code to show it here: <%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c" %> <%@ page contentType="text/...

Write to utf-8 file in python

I'm really confused with the codecs.open function. When I do: file = codecs.open("temp", "w", "utf-8") file.write(codecs.BOM_UTF8) file.close() It gives me the error UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) If I do: file = open("temp", "w") file.write(codecs.BOM_UTF8) file.cl...

Can Bison parse UTF-8 characters?

I'm trying to make a Bison parser to handle UTF-8 characters. I don't want the parser to actually interpret the Unicode character values, but I want it to parse the UTF-8 string as a sequence of bytes. Right now, Bison generates the following code which is problematic: if (yychar <= YYEOF) { yychar = yytoken = YYEOF; ...

Why can't I show accents in Latex?

As soon as I use accents in my text, it won't work anymore. It reports the error: ! Undefined control sequence. <argument> R\UTF {00E9}seau Ethernet l.88 \section{R\UTF{00E9}seau Ethernet} ? To explain the output a bit, I am trying to compile \section{Réseau Ethernet} in that line. I think it has to do with the enc...

MySQL C# Text Encoding Problems

I have an old MySQL database with encoding set to UTF-8. I am using Ado.Net Entity framework to connect to it. The string that I retrieve from it have strange characters when ë like characters are expected. For example: "ë" is "ë". I thought I could get this right by converting from UTF8 to UTF16. return Encoding.Unicode.GetString(...

using pyodbc on linux to insert unicode or utf-8 chars in a nvarchar mssql field

I am using Ubuntu 9.04 I have installed the following package versions: unixodbc and unixodbc-dev: 2.2.11-16build3 tdsodbc: 0.82-4 libsybdb5: 0.82-4 freetds-common and freetds-dev: 0.82-4 I have configured /etc/unixodbc.ini like this: [FreeTDS] Description = TDS driver (Sybase/MS SQL) Driver = /usr/lib/odbc/libt...

Collecting every word in document DOM tree with javascript

Suppose you have a large document with around ~7000 words. I need to send all data to server. I have no chance to use jquery, prototype etc. It should be clean OO javascript. Sample page would be json russian page I will exclude all tags and html markup from words. My question is; 1. How can i collect/harvest all (utf8) words from do...

How can I convert character references to UTF-8 strings in Ruby?

I have some content from feeds. In these feeds, UTF-8 characters are often encoded as character references, ie "å" is "&#xE5;". To avoid double encoding these in my views (ie "&amp;#xE5;") I want to convert these back to normal UTF_8 characters. How can I do this in Ruby? I want: "&#xE5;".convert_to_utf8 => "å" ...