views:

901

answers:

4

Hi,

I am new to Haskell and I am trying to implement a few known algorithms in it.

I have implemented merge sort on strings. I am a bit disappointed with the performance of my Haskell implementation compared to C and Java implementations. On my machine (Ubuntu Linux, 1.8 GHz), C (gcc 4.3.3) sorts 1 000 000 strings in 1.85 s, Java (Java SE 1.6.0_14) in 3.68 s, Haskell (GHC 6.8.2) in 25.89 s. With larger input (10 000 000 strings), C takes 21.81 s, Java takes 59.68 s, Haskell starts swapping and I preferred to stop the program after several minutes.

Since I am new to Haskell, I would be interested to know if my implementation can be made more time / space efficient.

Thank you in advance for any hint Giorgio

My implementation:

merge :: [String] -> [String] -> [String]
merge [] ys = ys
merge xs [] = xs
merge (x:xs) (y:ys) = if x < y
                        then x : (merge xs (y:ys))
                        else y : (merge (x:xs) ys)

mergeSort :: [String] -> [String]
mergeSort xs = if (l < 2)
                 then xs
                 else merge h t
               where l = length xs
                     n = l `div` 2
                     s = splitAt n xs
                     h = mergeSort (fst s)
                     t = mergeSort (snd s)
+3  A: 

I am not sure if this is the cause of your problem, but remember that lists are a sequential data structure. In particular, both length xs and splitAt n xs will take an amount of time proportional to the length of the list (O(n)).

In C and Java, you are most probably using arrays, which take constant time for both operations (O(1)).

Edit: answering your question on how to make it more efficient, you can use arrays in Haskell too.

CesarB
But you probably don't want to for representing strings. Norman Ramsey is right: that's what `Data.ByteString` is for.
Alexey Romanov
@alexey_r: I meant arrays for replacing [String], not for replacing String aka [Char] itself. Replacing String is a separate optimization.
CesarB
+9  A: 

In Haskell, a string is a lazy list of characters and has the same overhead as any other list. If I remember right from a talk I heard Simon Peyton Jones give in 2004, the space cost in GHC is 40 bytes per character. For an apples-to-apples comparation you probably should be sorting Data.ByteString, which is designed to give performance comparable to other languages.

Norman Ramsey
Thanks for the hint. I am not sure whether ByteString is the same as String. As far as I know String :: [Char]where Char is a unicode charactoer.On the other hand, BytyString contains a string of Word8, i.e. ofbytes. Then I should make sure that my input is in an one-byte-percharacter encoding, e.g. Latin1.Otherwise, how does a ByteString handle multibyte characters whenevaluating the lexicographic ordering?
http://hackage.haskell.org/package/utf8-string-0.3.5: "The utf8-string package provides operations for encoding UTF8 strings to Word8 lists and back, and for reading and writing UTF8 without truncation."
Alexey Romanov
+7  A: 

Better way to split the list to avoid the issue CesarB points out:

split []             = ([], [])
split [x]            = ([x], [])
split (x : y : rest) = (x : xs, y : ys)
                       where (xs, ys) = split rest

mergeSort []  = []
mergeSort [x] = [x]
mergeSort xs  = merge (mergesort ys) (mergesort zs)
                where (ys, zs) = split xs

EDIT: Fixed.

Alexey Romanov
@alexey_r: you have two errors in your code. one is that the "x : y : xs" pattern must come inside parentheses. the other is that the precedences of ":" and "$" make it that you give "split xs" to the function "x : fst"
yairchu
@alexey_r: I think your code currently calculates "split xs" twice. better use (ys, zs) = split xs. or splitxs = split xs. so there's only one invocation of split
yairchu
You are right, of course. Fixed.
Alexey Romanov
as a socialist I am bothered by the redundant dollar in "(mergesort $ ys)" :)
yairchu
Changed that as well (despite not being a socialist :))
Alexey Romanov
+5  A: 

Try this version:

mergesort :: [String] -> [String]
mergesort = mergesort' . map wrap

mergesort' :: [[String]] -> [String]
mergesort' [] = []
mergesort' [xs] = xs
mergesort' xss = mergesort' (merge_pairs xss)

merge_pairs :: [[String]] -> [[String]]
merge_pairs [] = []
merge_pairs [xs] = [xs]
merge_pairs (xs:ys:xss) = merge xs ys : merge_pairs xss

merge :: [String] -> [String] -> [String]
merge [] ys = ys
merge xs [] = xs
merge (x:xs) (y:ys)
 = if x > y
        then y : merge (x:xs)  ys
        else x : merge  xs    (y:ys)

wrap :: String -> [String]
wrap x = [x]
  1. Bad idea is splitting list first. Instead of it just make list of one member lists. Haslell is lazy, it will be done in right time.
  2. Then merge pairs of lists until you have only one list.

Edit: Someone who down-vote this answer: above merge sort implementation is same algorithm as used in ghc Data.List.sort except with cmp function removed. Well ghc authors are may be wrong :-/

Hynek -Pichi- Vychodil