Efficient String Implementation in Haskell

views:

999

answers:

+8 Q:

Efficient String Implementation in Haskell

I'm currently teaching myself Haskell, and I'm wondering what the best practices are when working with strings in Haskell.

The default string implementation in Haskell is a list of Char. This is inefficient for file input-output, according to Real World Haskell, since each character is separately allocated (I assume that this means that a String is basically a linked list in Haskell, but I'm not sure.)

But if the default string implementation is inefficient for file i/o, is it also inefficient for working with Strings in memory? Why or why not? C uses an array of char to represent a String, and I assumed that this would be the default way of doing things in most languages.

As I see it, the list implementation of String will take up more memory, since each character will require overhead, and also more time to iterate over, because a pointer dereferencing will be required to get to the next char. But I've liked playing with Haskell so far, so I want to believe that the default implementation is efficient.

+18 A:

Best practices for working with strings performantly in Haskell are basically: Use Data.ByteString/Data.ByteString.Lazy.

http://hackage.haskell.org/packages/archive/bytestring/latest/doc/html/

As far as the efficiency of the default string implementation goes in Haskell, it's not. Each Char represents a Unicode codepoint which means it needs at least 21bits per Char.

Since a String is just [Char], that is a linked list of Char, it means Strings have poor locality of reference, and again means that Strings are fairly large in memory, at a minimum it's N * (21bits + Mbits) where N is the length of the string and M is the size of a pointer (32, 64, what have you) and unlike many other places where Haskell uses lists where other languages might use different structures (I'm thinking specifically of control flow here), Strings are much less likely to be able to be optimized to loops, etc. by the compiler.

And while a Char corresponds to a codepoint, the Haskell 98 report doesn't specify anything about the encoding used when doing file IO, not even a default much less a way to change it. In practice GHC provides an extensions to do e.g. binary IO, but you're going off the reservation at that point anyway.

Even with operations like prepending to front of the string it's unlikely that a String will beat a ByteString in practice.

Logan Capaldo 2009-02-23 00:54:58

+1 exactly the package I was going to answer. ByteString stores strings as offsets into byte arrays. Data.ByteString.Char8 lets you use Chars directly in the ByteStrings by assuming that only the bottom 8 bits are important (i.e. ASCII). ByteString also provides its own efficient IO functions.

Chris Smith 2009-02-23 01:00:25

+5 A:

The answer is a bit more complex than just "use lazy bytestrings".

Byte strings only store 8 bits per value, whereas String holds real Unicode characters. So if you want to work with Unicode then you have to convert to and from UTF-8 or UTF-16 all the time, which is more expensive than just using strings. Don't make the mistake of assuming that your program will only need ASCII. Unless its just throwaway code then one day someone will need to put in a Euro symbol (U+20AC) or accented characters, and your nice fast bytestring implementation will be irretrievably broken.
Byte strings make some things, like prepending to the start of a string, more expensive.

That said, if you need performance and you can represent your data purely in bytestrings, then do so.

Paul Johnson 2009-02-24 08:17:29

+10 A:

Apart from String/ByteString there is now the Text library which combines the best of both worlds—it works with Unicode while being ByteString-based internally, so you get fast, correct strings.

Porges 2009-03-09 14:31:59

Nice; +1, thank you Porges.

Rob Lachlan 2009-03-09 15:52:48

+3 A:

The basic answer given, use ByteString, is correct. That said, all of the three answers before mine have inaccuracies.

Regarding UTF-8: whether this will be an issue or not depends entirely on what sort of processing you do with your strings. If you're simply treating them as single chunks of data (which includes operations such as concatenation, though not splitting), or doing certain limited byte-based operations (e.g., finding the length of the string in bytes, rather than the length in characters), you won't have any issues. If you are using I18N, there are enough other issues that simply using String rather than ByteString will start to fix only a very few of the problems you'll encounter.

Prepending single bytes to the front of a ByteString is probably more expensive than doing the same for a String. However, if you're doing a lot of this, it's probably possible to find ways of dealing with your particular problem that are cheaper.

But the end result would be, for the poster of the original question: yes, Strings are inefficient in Haskell, though rather handy. If you're worried about efficiency, use ByteStrings, and view them as either arrays of Char8 or Word8, depending on your purpose (ASCII/ISO-8859-1 vs Unicode of some sort, or just arbitrary binary data). Generally, use Lazy ByteStrings (where prepending to the start of a string is actually a very fast operation) unless you know why you want non-lazy ones (which is usually wrapped up in an appreciation of the performance aspects of lazy evaluation).

For what it's worth, I am building an automated trading system entirely in Haskell, and one of the things we need to do is very quickly parse a market data feed we receive over a network connection. I can handle reading and parsing 300 messages per second with a negligable amount of CPU; as far as handling this data goes, GHC-compiled Haskell performs close enough to C that it's nowhere near entering my list of notable issues.

Curt Sampson 2009-05-14 20:35:20

ansaurus

tags:

views:

answers:

Efficient String Implementation in Haskell

related questions