tags:

views:

392

answers:

2

I am working on a project that will require internationalisation support down the track. I want to get started on the right foot with UTF support, and I was wondering what the best practice for handling UTF in Erlang is?

From my current research it seems there are a couple of issues with Erlang's built in string handling for some use cases (JSON parsing being a good example).

I have been looking at Starling and read (somewhere) recently that it is possibly going to be rolled into the standard Erlang release as the UTF 'standard'. Is this true? Are there other libraries or approaches I should be looking at?

From the comments:

EEP (Erlang Enhancement Proposal) 10 details Representing Unicode characters in Erlang

+3  A: 

This page:

http://erlang.org/doc/highlights.html

...lists hightlights of release 5.7/OTP R13A. Note this passage:

1.2 Unicode support

Support for Unicode is implemented as described in EEP10. Formatting and reading of unicode data both from terminals and files is supported by the io and io_lib modules. Files can be opened in modes with automatic translation to and from different unicode formats. The module 'unicode' contains functions for conversion between external and internal unicode formats and the re module has support for unicode data. There is also language syntax for specifying string and character data beyond the ISO-latin-1 range.

I don't like to make pronouncements on what best practices would be, but I often find it helpful to have a minimal, complete example to start to generalize from. Here's one of getting utf into an erlang application and sending it out again to a different context. Assuming you had a MySql database with a row field in a table containing utf8 characters, here's one way to get it out and pipe it to a web browser as json:

hg clone http://bitbucket.org/justin/webmachine/ webmachine-read-only
cd webmachine-read-only
make
./scripts/new_webmachine.erl mywebdemo /tmp
svn checkout http://erlang-mysql-driver.googlecode.com/svn/trunk/ erlang-mysql-driver-read-only
cd erlang-mysql-driver-read-only/src
cp * /tmp/mywebdemo/src
svn checkout http://mochiweb.googlecode.com/svn/trunk/ mochiweb-read-only
cp mochiweb-read-only/src/mochijson2.erl /tmp/mywebdemo/src
cd /tmp/mywebdemo

Edit src/mywebdemo_resource.erl so it looks like this:

-module(mywebdemo_resource).
-export([init/1, to_html/2]). 

-include_lib("webmachine/include/webmachine.hrl").

init([]) -> {ok, undefined}.

to_html(ReqData, State) ->
    mysql:start_link(pool_id, "database.host.com", 3306, "db_user", "db_password", "db_name", fun(A, B, C, D) -> ouch end, utf8), %% add your connection string info
    {data, Res} = mysql:fetch(pool_id, "select * from table where IdWhatever = 13"),
    [[_, Utf8Str, _]] = mysql:get_result_rows(Res), %% pattern will need to be altered to match your table structure
    {mochijson2:encode({struct, [{Utf8Str, 100}]}), ReqData, State}.

Build everything and start the url dispatcher:

make
./start.sh

Then execute the following in a web page (or something more convenient, like MozRepl):

var req = new XMLHttpRequest;
req.open('GET', "http://localhost:8000", false);
req.send(null);
eval("(" + req.responseText + ")");
mwt
A: 

As the previous poster mentioned the latest release of erlang supports utf natively. If you can't use the latest though then one thing I do usually is to use binaries for string data. It keeps erlang from mangling the bytes in a list. It has the side effect of making lists of strings easier to handle as well.

Jeremy Wall