tags:

views:

83

answers:

3

I realize this may be a rather heretical question, but I wonder whether I can mmap a file of data, via System.IO.Posix.MMap, and then cast the resulting ByteString into a strict array of some other type? Eg. if I know that the file contains doubles, can I somehow get this mmapped data into an UArr Double so I can do sumU etc on it, and have the virtual memory system take care of IO for me? This is essentially how I deal with multi-GB data sets in my C++ code. Alternative more idiomatic ways to do this also appreciated, thanks!

Supreme extra points for ways I can also do multicore processing on the data :-) Not that I'm demanding or anything.

+1  A: 

I'm afraid I don't know how to cast a ByteString to a UArr T, but I'd like to claim some "extra points" by suggesting you take a look at Data Parallel Haskell; from the problem you've described it could be right up your street.

Dave Tapley
Yes, DPH and other shiny Haskell toys are really rather appealing. Once I get more of a grip on the language I want to try it out on some of my larger problems (data sets in the 10s of GB range minimum).
billt
+3  A: 

I don't think it is safe to do this. UArr are Haskell heap allocated unpinned memory, the GC will move it. ByteStrings (and mmapped ones) are ForeignPtrs to pinned memory. They're different objects in the runtime system.

You will need to copy for this to be safe, if you're changing the underlying type from ForeignPtr to a Haskell value 'a'.

Don Stewart
Thanks; I feared this would be the case. I've never had much luck manipulating large data sets once they need to be loaded into the GC'd space of any language. My current just mmap 'em approach usually works out ok. Will give the copying a go on some reduced data sets and see how things work out.
billt
A: 

You probably want Foreign.Marshal here, and especially Foreign.Marshal.Array. It was designed to do exactly this.

Paul Johnson