utf-8

How to convert a file to utf-8 in Python?

I need to convert a bunch of files to utf-8 in Python, and I have trouble with the "converting the file" part. I'd like to do the equivalent of: iconv -t utf-8 $file > converted/$file # this is shell code Thanks! ...

Unicode issues with acts_as_taggable_on_steroids

I'm implementing a blog with tags with some French characters. My question has to do with how to deal with spaces and unicode (utf-8) characters in the url. let's say I have a tag called: ohlàlà! and I have the following code in my tag cloud: <%= link_to h(tag.name.capitalize), { :controller => :blog, :action => :tag, :id => h(tag.name...

How do I reverse a UTF-8 string in place?

Recently, someone asked about an algorithm for reversing a string in place in C. Most of the proposed solutions had troubles when dealing with non single-byte strings. So, I was wondering what could be a good algorithm for dealing specifically with utf-8 strings. I came up with some code, which I'm posting as an answer, but I'd be glad ...

How do I set the byte order marker for Unicode files?

I know this is not a "real" programming question. But, it relates to programming so I am going to set it anyway. I have a program that I need to test that reads the Byte Order Marker of the file to see if it is utf-8 or utf-16. My problem is I cannot find a program/text editor that will allow me to set the byte order marker. Can anyb...

How to make MySQL handle UTF-8 properly

One of the responses to a question I asked yesterday suggested that I should make sure my database can handle UTF-8 characters correctly. Anyone know how I can do this with MySQL? Thanks! Ben ...

Malformed UTF characters

I want to detect and replace the Malformed UTF-8 characters with blank space using Perl script while loading the data using SQL*Loader. How to do? ...

How do I input 4-byte UTF-8 characters?

I am writing a small app which I need to test with utf-8 characters of different number of byte lengths. I can input unicode characters to test that are encoded in utf-8 with 1,2 and 3 bytes just fine by doing, for example: string in = "pi = \u3a0"; But how do I get a unicode character that is encoded with 4-bytes? I have tried: str...

Elegant way to search for UTF-8 files with BOM?

For debugging purposes, I need to recursively search a directory for all files which start with a UTF-8 byte order mark (BOM). My current solution is a simple shell script: find -type f | while read file do if [ "`head -c 3 -- "$file"`" == $'\xef\xbb\xbf' ] then echo "found BOM in: $file" fi done Or, if you prefer s...

looking for a UTF-8 text editor

I am looking for a (simple) text editor that can handle text in different encodings in the same document. I need to develop some sites with mixed Japanese and English text and the editors I have now (on an English Windows system) are unable to display the Japanese text. Jedit files don't display the Japanese text I have inputted but whe...

How to I root out a mystery character encoding problem in a Wordpress blog?

I am attempting to start a new Wordpress blog. I am seeing funny characters in some browsers but not others instead of single quotes, double quotes and ellipses. Things I already thought of: The HTML template page for output itself is set to UTF-8 The admin page is UTF-8 The MySQL database tables where the data is stored are UTF-8 en...

How do I replace accented Latin characters in Ruby?

I have an ActiveRecord model, Foo, which has a name field. I'd like users to be able to search by name, but I'd like the search to ignore case and any accents. Thus, I'm also storing a canonical_name field against which to search: class Foo validates_presence_of :name before_validate :set_canonical_name private def set_cano...

Importing extended ASCII into Oracle

I have a procedure that imports a binary file containing some strings. The strings can contain extended ASCII, e.g. CHR(224), 'à'. The procedure is taking a RAW and converting the BCD bytes into characters in a string one by one. The problem is that the extended ASCII characters are getting lost. I suspect this is due to their values me...

UTF8 LAMP Resources

At work, I'm beginning to have some issues with character encoding. I'd like to make our web app use UTF-8 all the way around. After a few hours of googling, I've only found a few sites with information on a UTF-8 LAMP setup. Does anyone know of any good resources online about UTF-8, Linux, Apache, MySql and PHP? I'll post what I've foun...

OS X file duplication converts text encoding by default

All the PHP files in my workspace are encoded in Unicode (UTF-8, no BOM). I often duplicate an existing source file to use as a base for a new script. Invariably (with Path Finder or the original Finder), OS X will convert the encoding of the duplicate file to Western (Mac OS Roman). Is there any way to make OS X behave and not convert ...

Simplest way to convert unicode codepoint into UTF-8

What's the simplest way to convert a Unicode codepoint into a UTF-8 byte sequence in C? The only way that springs to mind is using iconv to map from the UTF-32LE codepage to UTF-8, but that seems like overkill. ...

Can I recover international characters mistakenly stored in a varchar field?

My client has an old MS SQL 2000 database that uses varchar(50) fields to store names. He tried to use this database to capture some data (via a web form). Some of the form-fillers are from other countries, and the varchar fields went nutty when some of these folks entered their names. Is it possible to recover the data somehow? Maybe by...

VS2008 Express: How to save as UTF-8 all files by default?

Hi, Is there any way to make Visual Studio 2008 Express store all the files as UTF-8 by default? Thanks for your time. Best regards. ...

Outlook autocleaning my line breaks and screwing up my email format

I'm sending an email using the dotnet framework. Here is the template that I'm using to create the message: Date of Hire: %HireDate% Annual Salary: %AnnualIncome% Reason for Request: %ReasonForRequest% Name of Voluntary Employee: %FirstName% %LastName% Total Coverage Applied For: %EECoverageAmount% Guaranteed Coverage Portion: %GICove...

HtmlEncode UTF-8

I'm using Server.HtmlEncode on a utf-8 string in asp-classic, which works fine until there are some accents in the string e.g. Rüstü Recber, which appears as Rüstü Recber (R&#195;&#188;st&#195;&#188; Recber in the source). I've tried setting the Response.Charset property to utf-8 but this doesn't make any difference. ...

How to use regex for utf8 in ruby

In RoR,how to validate a Chinese or a Japanese word for a posting form with utf8 code. In GBK code, it uses [\u4e00-\u9fa5]+ to validate Chinese words. In Php, it uses /^[\x{4e00}-\x{9fa5}]+$/u for utf-8 pages. ...