tags:

views:

217

answers:

3

I just want to read (and maybe write) UTF-8 data. haskell.org still advertises System.Streams which does not compile with recent ghc:

% runhaskell Setup.lhs configure
Configuring Streams-0.2.1...
runhaskell Setup.lhs build
Preprocessing library Streams-0.2.1...
Building Streams-0.2.1...
[10 of 45] Compiling System.FD        ( System/FD.hs, dist/build/System/FD.o )

System/FD.hs:138:22:
    Couldn't match expected type `GHC.IOBase.FD'
           against inferred type `FD'
    In the first argument of `fdType', namely `fd'
    In a 'do' expression: fd_type <- fdType fd
    In the expression:
        let
          oflags1 = case mode of
                      ReadMode -> ...
                      WriteMode -> ...
                      ReadWriteMode -> ...
                      AppendMode -> ...
          binary_flags | binary = o_BINARY
                       | otherwise = 0
          oflags = oflags1 .|. binary_flags
        in
          do fd <- fdOpen filepath oflags 438
             fd_type <- fdType fd
               when (mode == WriteMode && fd_type == RegularFile)
             $ do fdSetFileSize fd 0
             ....

Similar problem with Streams 0.1. I cannot get more recent versions since the official site is down:

% wget http://files.pupeno.com/software/streams/Streams-0.1.7.tar.bz2
--2009-07-30 15:36:14--  http://files.pupeno.com/software/streams/Streams-0.1.7.tar.bz2
Resolving files.pupeno.com... failed: Name or service not known.
wget: unable to resolve host address `files.pupeno.com'

A better solution? darcs source code?

A: 

utf-8 strings are just byte character sequences, so it should be possible to just read and write the strings as is. All of the first 127 characters, including whitespace, should be ascii. Of course, you would need your own functions for manipulating strings since they are now multi byte sequences.

Juan
I want to intepret them, of course, not just to read as binary. Haskell strings are made of characters, not bytes.
bortzmeyer
You didn't say you wanted to interpret them. There are applications where you don't need to interpret the actual string data.
Juan
+1  A: 

Edit:

L. Kolmodin is right: utf8-string or text is the right answer. I'll leave my original answer below for reference. Google seems to have steered me wrong in choosing IConv. (The equivalent of my IConv wrapper function is already in utf8-string as Codec.Binary.UTF8.String.encodeString.)


Here is what I've been using--I may not remember the complete solution, so let me know if you still run into problems:

From Hackage, install IConv. Unfortunately, Codec.Text.IConv.convert operates on bytestrings, not strings. I guess you could read files directly as bytestrings, but I wrote a converter since HaXml uses normal strings:

import qualified Data.ByteString.Lazy.Char8 as B
utf8FromLatin1 = B.unpack . convert "LATIN1" "UTF-8" . B.pack

Now, on Mac OS, you have to compile with

$ ghc -O2 --make -L/usr/lib -L/opt/local/lib Whatever.hs

Because there was some library conflict, I think with MacPorts, I have to point explicitly to the built-in iconv libraries. There is probably a way to always pass those -L flags to ghc, but I haven't looked it up yet.

Nathan Sanders
+4  A: 

Use the utf8-string or the more recent text package.

View the list of packages on hackage.

L. Kolmodin
100% agreement with Kolmodin. Use utf8-string or Data.Text
Don Stewart
Data.Text seems very recent, it is not packaged for Debian (either "stable" or "testing"). I could install it from Hackage but I prefer to keep my systems "clean".
bortzmeyer
utf8-string is available with ghc, is simple to use and filled my needs. Thanks, accepted.
bortzmeyer