questions about utf-8

How to convert a file to utf-8 in Python?

I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part. I'd like to do the equivalent of: iconv -t utf-8 $file > converted/$file # this is shell code Thanks! ...

python

encoding

files

utf-8

Unicode issues with acts_as_taggable_on_steroids

I'm implementing a blog with tags with some French characters. My question has to do with how to deal with spaces and unicode (utf-8) characters in the url. let's say I have a tag called: ohlàlà! and I have the following code in my tag cloud: <%= link_to h(tag.name.capitalize), { :controller => :blog, :action => :tag, :id => h(tag.name...

ruby-on-rails

unicode

utf-8

How do I reverse a UTF-8 string in place?

Recently, someone asked about an algorithm for reversing a string in place in C. Most of the proposed solutions had troubles when dealing with non single-byte strings. So, I was wondering what could be a good algorithm for dealing specifically with utf-8 strings. I came up with some code, which I'm posting as an answer, but I'd be glad ...

How do I set the byte order marker for Unicode files?

I know this is not a "real" programming question. But, it relates to programming so I am going to set it anyway. I have a program that I need to test that reads the Byte Order Marker of the file to see if it is utf-8 or utf-16. My problem is I cannot find a program/text editor that will allow me to set the byte order marker. Can anyb...

How to make MySQL handle UTF-8 properly

One of the responses to a question I asked yesterday suggested that I should make sure my database can handle UTF-8 characters correctly. Anyone know how I can do this with MySQL? Thanks! Ben ...

mysql

utf-8

Malformed UTF characters

I want to detect and replace the Malformed UTF-8 characters with blank space using Perl script while loading the data using SQL*Loader. How to do? ...

utf-8

character-encoding

How do I input 4-byte UTF-8 characters?

I am writing a small app which I need to test with utf-8 characters of different number of byte lengths. I can input unicode characters to test that are encoded in utf-8 with 1,2 and 3 bytes just fine by doing, for example: string in = "pi = \u3a0"; But how do I get a unicode character that is encoded with 4-bytes? I have tried: str...

c++

unicode

utf-8

Elegant way to search for UTF-8 files with BOM?

For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script: find -type f | while read file do if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ] then echo "found BOM in: $file" fi done Or, if you prefer s...

looking for a UTF-8 text editor

I am looking for a (simple) text editor that can handle text in different encodings in the same document. I need to develop some sites with mixed Japanese and English text and the editors I have now (on an English Windows system) are unable to display the Japanese text. Jedit files don't display the Japanese text I have inputted but whe...

How to I root out a mystery character encoding problem in a Wordpress blog?

I am attempting to start a new Wordpress blog. I am seeing funny characters in some browsers but not others instead of single quotes, double quotes and ellipses. Things I already thought of: The HTML template page for output itself is set to UTF-8 The admin page is UTF-8 The MySQL database tables where the data is stored are UTF-8 en...

wordpress

encoding

utf-8

How do I replace accented Latin characters in Ruby?

I have an ActiveRecord model, Foo, which has a name field. I'd like users to be able to search by name, but I'd like the search to ignore case and any accents. Thus, I'm also storing a canonical_name field against which to search: class Foo validates_presence_of :name before_validate :set_canonical_name private def set_cano...

ruby

unicode

utf-8

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one. The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values me...

oracle

conversion

utf-8

UTF8 LAMP Resources

At work, I'm beginning to have some issues with character encoding. I'd like to make our web app use UTF-8 all the way around. After a few hours of googling, I've only found a few sites with information on a UTF-8 LAMP setup. Does anyone know of any good resources online about UTF-8, Linux, Apache, MySql and PHP? I'll post what I've foun...

utf-8

lamp

online-resources

OS X file duplication converts text encoding by default

All the PHP files in my workspace are encoded in Unicode (UTF-8, no BOM). I often duplicate an existing source file to use as a base for a new script. Invariably (with Path Finder or the original Finder), OS X will convert the encoding of the duplicate file to Western (Mac OS Roman). Is there any way to make OS X behave and not convert ...

Simplest way to convert unicode codepoint into UTF-8

What's the simplest way to convert a Unicode codepoint into a UTF-8 byte sequence in C? The only way that springs to mind is using iconv to map from the UTF-32LE codepage to UTF-8, but that seems like overkill. ...

c

unicode

utf-8

Can I recover international characters mistakenly stored in a varchar field?

My client has an old MS SQL 2000 database that uses varchar(50) fields to store names. He tried to use this database to capture some data (via a web form). Some of the form-fillers are from other countries, and the varchar fields went nutty when some of these folks entered their names. Is it possible to recover the data somehow? Maybe by...

VS2008 Express: How to save as UTF-8 all files by default?

Hi, Is there any way to make Visual Studio 2008 Express store all the files as UTF-8 by default? Thanks for your time. Best regards. ...

visual-studio-2008

utf-8

visual-studio-express

Outlook autocleaning my line breaks and screwing up my email format

I'm sending an email using the dotnet framework. Here is the template that I'm using to create the message: Date of Hire: %HireDate% Annual Salary: %AnnualIncome% Reason for Request: %ReasonForRequest% Name of Voluntary Employee: %FirstName% %LastName% Total Coverage Applied For: %EECoverageAmount% Guaranteed Coverage Portion: %GICove...

HtmlEncode UTF-8

I'm using Server.HtmlEncode on a utf-8 string in asp-classic, which works fine until there are some accents in the string e.g. Rüstü Recber, which appears as RÃ¼stÃ¼ Recber (RÃ¼stÃ¼ Recber in the source). I've tried setting the Response.Charset property to utf-8 but this doesn't make any difference. ...

How to use regex for utf8 in ruby

In RoR,how to validate a Chinese or a Japanese word for a posting form with utf8 code. In GBK code, it uses [\u4e00-\u9fa5]+ to validate Chinese words. In Php, it uses /^[\x{4e00}-\x{9fa5}]+$/u for utf-8 pages. ...