Firstly, if you're doing performance comparisons for numeric stuff, lists aren't the best choice. Try a package like the vector package for fast arrays.
And note that you can do even better in Haskell, thanks to loop fusion. By writing the create function as an enumeration the compiler can combine the create step, and the fold loop, into a single loop that allocates no intermediate data structures. The ability to do general fusion like this is unique to GHC Haskell.
I'll use the vector library (stream-based loop fusion):
import qualified Data.Vector as V
test = V.foldl (\ a b -> a + b * sqrt b) 0
create n = (V.enumFromTo 1 n)
main = print (test (create 1000000))
Now, before, with your code, the compiler is unable to remove all lists, and we end up with an inner loop like:
$wlgo :: Double# -> [Double] -> Double#
$wlgo =
\ (ww_sww :: Double#) (w_swy :: [Double]) ->
case w_swy of _ {
[] -> ww_sww;
: x_aoY xs_aoZ ->
case x_aoY of _ { D# x1_aql ->
$wlgo
(+##
ww_sww (*## x1_aql (sqrtDouble# x1_aql)))
xs_aoZ
}
}
$wcreate :: Double# -> [Double]
$wcreate =
\ (ww_swp :: Double#) ->
case ==## ww_swp 0.0 of _ {
False ->
:
@ Double
(D# ww_swp)
($wcreate (-## ww_swp 1.0));
True -> [] @ Double
}
Note how there are two loops: create generating a (lazy) list, and the fold consuming it. Thanks to laziness, the cost of that list is cheap, so it runs in a respectable:
$ time ./C
4.000004999999896e14
./C 0.06s user 0.00s system 98% cpu 0.058 total
Under fusion, however, we get instead a single loop only!
main_$s$wfoldlM_loop :: Double# -> Double# -> Double#
main_$s$wfoldlM_loop =
\ (sc_sYc :: Double#) (sc1_sYd :: Double#) ->
case <=## sc_sYc 1000000.5 of _ {
False -> sc1_sYd;
True ->
main_$s$wfoldlM_loop
(+## sc_sYc 1.0)
(+##
sc1_sYd (*## sc_sYc (sqrtDouble# sc_sYc)))
GHC reduced our create and test steps into a single loop with no lists used. Just 2 doubles in registers.
And with half as many loops, it runs nearly twice as fast:
$ ghc D.hs -Odph -fvia-C -optc-O3 -optc-march=native -fexcess-precision --make
$ time ./D
4.000005000001039e14
./D 0.04s user 0.00s system 95% cpu 0.038 total
This is a nice example of the power that guarantees of purity provide -- the compiler can be very aggressive at reording your code.