views:

113

answers:

1

Hi

I have a file which look like this index : label, index's value contain keys in the range of 0... 100000000 and label can be any String value, I want split this file which has 110 Mo in many slices of 100 lines each an make some computation upon each slice. How can I do this?

123 : "acgbdv"

127 : "ytehdh"

129 : "yhdhgdt"

...

9898657 : "bdggdggd"
+2  A: 

If you're using String IO, you can do the following:

import System.IO
import Control.Monad

-- | Process 100 lines
process100 :: [String] -> MyData
-- whatever this function does

loop :: [String] -> [MyData]
loop lns = go [] lns
  where
    go acc []  = reverse acc
    go acc lns = let (this, next) = splitAt 100 lns in go (process100 this:acc) next

processFile :: FilePath -> IO [MyData]
processFile f = withFile f ReadMode (fmap (loop . lines) . hGetContents)

Note that this function will silently process the last chunk even if it isn't exactly 100 lines.

Packages like bytestring and text generally provide functions like lines and hGetContents so you should be able to easily adapt this function to any of them.

It's important to know what you're doing with the results of processing each slice, because you don't want to hold on to that data for longer than necessary. Ideally, after each slice is calculated the data would be entirely consumed and could be gc'd. Generally either the separate results get combined into a single data structure (a "fold"), or each one is dealt with separately (maybe outputting a line to a file or something similar). If it's a fold, you should change "loop" to look like this:

loopFold :: [String] -> MyData -- assuming there is a Monoid instance for MyData
loopFold lns = go mzero lns
  where
    go !acc []  = acc
    go !acc lns = let (this, next) = splitAt 100 lns in go (process100 this `mappend` acc) next

The loopFold function uses bang patterns (enabled with "LANGUAGE BangPatterns" pragma) to force evaluation of the "MyData". Depending on what MyData is, you may need to use deepseq to make sure it's fully evaluated.

If instead you're writing each line to output, leave loop as it is and change processFile:

processFileMapping :: FilePath -> IO ()
processFileMapping f = withFile f ReadMode pf
  where
    pf = mapM_ (putStrLn . show) <=< fmap (loop . lines) . hGetContents

If you're interested in enumerator/iteratee style processing, this is a pretty simple problem. I can't give a good example without knowing what sort of work process100 is doing, but it would involve enumLines and take.

Is it necessary to process exactly 100 lines at a time, or do you just want to process in chunks for efficiency? If it's the latter, don't worry about it. You'd most likely be better off processing one line at a time, using either an actual fold function or a function similar to processFileMapping.

John