How can you sort a long list of data (strings, floats etc) that is read from a large file (say a few million lines) using a Data.Vector.Generic.Mutable object and a sort algorithm from Data.Vector.Algorithms?
Here's how to do it in the general case.
First, you need a mutable vector. You can build this incrementally as you scan the file; allocate a vector that's about as big as you need, and increase the size and copy when you run out of space. Or you can read the entire file, count the record separators, and allocate the right amount of space all at once. This is easier but probably not acceptable In Real Life. (The expand-as-needed strategy is pretty common; if you ever use a language like Perl and push lines you read from a file onto the end of an array, this is what's happening. Perl allocates some space for an array, when you fill it, it increases the amount of space, allocates new space, and copies.)
Anyway, I'm too lazy to do this, so I am just going to create a vector with some random numbers in it.
We need a bunch of libraries for this:
import Control.Monad
import System.Random
import qualified Data.Vector as IV
import qualified Data.Vector.Mutable as MV
import qualified Data.Vector.Generic as V
import qualified Data.Vector.Algorithms.Intro as VA
We don't need this all at once, but we'll need it eventually, so I thought I'd get it out of the way.
Anyway, our mutable vector is going to be a "normal" mutable vector,
here MV.MVector
.
The idea of a mutable vector is that you create it and modify it in a
number of steps. In Haskell, there are a few ways to make this look
pure to the calling code; one is to do it all inside the ST
monad.
Your ST action is creating the vector, modifying it, and "freezing" it
into an immutable vector. Internally, you are using fast
modify-this-memory-location-a-bunch-of-times operations, but
externally, you have something that is pure. (Read the paper about
ST
if you want an argument as to why this is safe.)
Another way to deal with mutable data is to just do it inside the
Everything, erm, IO
, monad. That's what we're going to do here, as
it's most convenient.
(Data.Vector.Mutable
has two predefined vector types for you,
IOVector
and STVector
. We're using IOVector
, which puts all
the vector operations into IO
.)
So like 8 paragraphs ago, we were going to create a mutable vector to sort. And here we are:
randVector :: IO (MV.IOVector Int) randVector = do v <- MV.new 10 forM [0..9] $ \x -> do r <- randomIO :: IO Int MV.write v x r return v
This is an IO action that returns a new mutable vector with 10 random numbers inside it. (Random number generation can also be conveniently lifted into the IO monad, so we did that too, for convenience! It's like we're writing C, but with nicer syntax and more type safety.)
That's actually the hard part. To do the sorting, I imported
Data.Vector.Algorithms.Intro
which is basically an in-place
quicksort. A function called sort
does the actual sorting (in
whichever monad the mutable vector is in).
An action that creates the random mutable vector and sorts it in place looks like:
sort = VA.sort =<< randVector
Now, to print that out, all we need to do is "freeze" the vector into
an immutable vector, and print out the .toList
. Or you can just
iterate over every element and print it.
Here's what I came up with as an example:
main = do
v <- randVector
VA.sort v
iv <- V.unsafeFreeze v :: IO (IV.Vector Int)
print . V.toList $ iv
V.unsafeFreeze
is from Data.Vector.Generic
(how you interact with
all vector types with the same API), as is V.toList
.
Anyway, it's worth noting that IO is purely for convenience here.
Since you're building a vector from file data, it's appropriate. 99%
of the time, though, you should use ST
. In your ST action, create
the vector, sort it, freeze it, and return the frozen version.
(I have to go, so someone else feel free to add a ST example. If nobody does, I will do it later tonight.)