views:

170

answers:

2

Hi

I wrote a Haskell code which has to solve the following problem : we have n files : f1, f2, f3 .... fn and I cut those files such a way that each slice has 100 lines

  f1_1, f1_2, f1_3 .... f1_m

  f2_1, f2_2, .... f2_n
  ...

  fn_1, fn_2, .... fn_k

finally I construct a special data type (Dags) using slices in the following way

  f1_1, f2_1, f3_1, .... fn_1 => Dag1

  f1_2, f2_2, f3_2, ..... fn_2 => Dag2

  ....

  f1_k, f2_k, f3_k, ..... fn_k => Dagk

the code that I wrote start by cutting all the files, then it couple the i-th elements of the results list and construct Dag using the final result list

it looks like this

  -- # take a filename and cut the file in slices of 100 lines

  sliceFile :: FilePath -> [[String]]

  -- # take a list of lists and group the i-th elements into list

  coupleIthElement :: [[String]] -> [[String]]

  -- # take a list of lines and create a DAG

  makeDags :: [String] ->  Dag

  -- # final code look like this

  makeDag_ :: [FilePath] -> [Dag]

  makeDags files = map makeDags $ coupleIthElement (concat (map sliceFile files))

The problem is that this code is non-efficient because :

  • it needs storing all the files in memory in list form

  • the garbage collector is not working efficiently since all fonctions need the results list of the previous fonction

How could I re-write my program to take advantage of garbage collector work and Lazyness of Haskell ?

if not possible or easier, what can i do to be more efficient even a bit ?

thanks for reply


edit

coupleIthElement ["abc", "123", "xyz"] must return ["a1x","b2y","c3z"]

of cause the 100 lines are arbitrary selected using a particular criteria upon some element of the lines but i discard this aspect to make the problem more easier to understand,

another edition

data Dag = Dag ([(Int, String)], [((Int, Int), Int)]) deriving Show

test_dag = Dag ([(1, "a"),(2, "b"),(3, "c")],[((1,2),1),((1,3),1)])

test_dag2 = Dag ([],[])

the first list is each vertice define by the number and the label, the second list is the edges ((1,2),3) means edge between vertice 1 and 2 with the cost 3

+1  A: 

If it's not needed to process your file in slices, avoid this. Haskell does this automatically! In Haskell, you think of IO as a stream. Data is read from input, as soon as it's needed and discarded, as soon as it's unused. So for instance, this is an easy file-copying programm:

main = interact id

interact has the signature interact :: (String -> String) -> IO (), and feeds the input into a function which handles it and produces some output, which is written to stdout. This program is more efficient then most C-implementations, as the runtime automatically buffers the input and output.

If you want to understand laziness, you have to forget all the wisdom you learned as a imperative programmer and have to think about a program as a description to modify data, not as a set of instructions - data is only processed when needed!

The key point, why your data may be handled the wrong way is the multiple traversion of the list. Your function makeDags traverses the transposed the slices list one by one, so the elements of the original list may not be discarded. What you should try, is to write your function in a way like this:

sliceFile :: FilePath -> [[String]]
sliceFile fp = do
  f <- readFile fp
  let l = lines fp
      slice [] = []
      slice x  = ll : slice ls where (ll,ls) = splitAt 100 x
  return slice l

sliceFirstRow :: [[String]] -> ([String],[[String]])
sliceFirstRow list = unzip $ map (\(x:xs) -> (x,xs)) list

makeDags :: [[String]] -> [Dag]
makeDags [[]] = []
makeDags list = makeDag firstRow : makeDags restOfList where
 (firstRow,restOfList) = sliceFirstRow list

This function may be a solution, since the first row is no longer referenced, when it's done. But in the most places, this is a result of laziness, so you could probably try to use seq to force building the Dags and allowing the IO data to be garbage-collected. (If you don't force building the dags, the data can't be garbage collected).

But anyway, I could provide a more helpfull answer, if you give some informations about what these dags are.

FUZxxl
Pretty good answer, but the information about ByteStrings is incorrect. `Data.ByteString.Lazy` is for binary data, and operations act on `Word8`, the Haskell type for an 8-bit word. `Data.ByteString.Lazy.Char8` is for ASCII text, operations act on (a subset of) `Char`s. Both take the same memory for storage.
John
Indeed. @FUZxxl, can you stop answering ByteString questions until you actually use them? This is like the 10th answer I've seen of yours that is incorrect. All strict bytestrings are 8-bit word vectors. Same exact underlying data, same Haskell type. The functions in .Char8, though, simply convert to and from Char as necessary. (For example, `pack` packs Chars instead of Word8s, and `map` applies a Char to the mapping function instead of applying a Word8.) You can use the .Char8 and normal functions on the same data, the functions are just *views* onto the same underlying data.
jrockway
thanks FUZxxl for your code it really helps me. but how are you dealing with many files, I've try too make it walk over many file without any succes ?
Leonzo Constantini
@Leonzo Constantini: It would be very interesting to know, how you're dealing with the dags. What is there use? Why do you need slices of 100 lines? I can provide you further information then.
FUZxxl
+1  A: 

A few points:

1) Have you considered using fgl? It's probably more efficient than your own Dag implementation. If you really need to use Dag, you could construct your graphs with fgl then convert them to Dag when they're complete.

2) It seems like you don't actually use the slices when constructing your graphs, rather they control how many graphs you have. If so, how about something like this:

dagFromHandles :: [Handle] -> IO Dag
dagFromHandles = fmap makeDags . mapM hGetLine

allDags :: [FilePath] -> IO [Dag]
allDags listOfFiles = do
  handles <- mapM (flip openFile ReadMode) listOfFiles
  replicateM 100 (dagFromHandles handles)

This assumes that each file has at least 100 lines, and any extra lines will be ignored. Even better would be if you had a function that would consume a Dag, then you could do

useDag :: Dag -> IO ()

runDags :: [FilePath] -> IO ()
runDags listOfFiles = do
  handles <- mapM (flip openFile ReadMode) listOfFiles
  replicateM_ 100 (dagFromHandles handles >>= useDag)

This should make more efficient use of garbage collection.

Of course this assumes that I understand the problem properly, and I'm not certain that I do. Note that concat (map sliceFile) should be a no-op (sliceFile would need to be in IO as you've defined the type, but ignoring that for now), so I don't see why you're bothering with it at all.

John