ansaurus

Question

Improving clojure lazy-seq usage for iterative text parsing

Answer 1

+1 A:

Your average function is non-lazy -- it needs to realise the entire coll argument while holding onto its head. Update: Just realised that my original answer included a nonsensical suggestion as to how to solve the above problem... argh. Fortunately ataggart has since posted a correct solution.

Other than that, your code does seem lazy at first glance, though the use of read-lines is currently discouraged (use line-seq instead).

If the file is really large and your functions will be called a large number of times, type-hinting seq-iter in the argument vector of seq-length -- ^NameOfBiojavaSeqIterClass seq-iter, use #^ in place of ^ if you're on Clojure 1.1 -- might make a significant difference. In fact, (set! *warn-on-reflection* true), then compile your code and add type hints to remove all reflection warnings.

Michał Marczyk 2010-07-21 21:23:18

Answer 2

+2 A:

It probably doesn't matter, but average is holding onto the head of the seq of lengths.
The following is a wholly untested, but lazier way to do what I think you want.

(use 'clojure.java.io) ;' since 1.2

(defn lazy-avg [coll]
  (let [f (fn [[v c] val] [(+ v c) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt)))

(defn fasta-avg [f]
  (->> (reader f) 
    line-seq
    (filter #(not (.startsWith % ">")))
    (map #(.length %))
    lazy-avg))

Alex Taggart 2010-07-21 21:54:52

Actually I think this is quite likely to matter with large datasets... I even posted an answer pointing this out previously, but then suggested a ridiculous way of dealing with the problem in a momentary lapse of reason -- +1 for the correct solution to this one.

Michał Marczyk 2010-07-21 22:05:53

I think groups lines between the >s are to be considered as single records, though; something like `(partition-by #(.startsWith ^String % ">"))` might help. The general idea remains the same.

Michał Marczyk 2010-07-21 22:24:24

ataggart and Michal, many thanks for these pointers. With them, here is a cleaner version that finishes in 1/4 the time: http://gist.github.com/485853. This is about 2x slower than my Python version, including JVM start up time from the commandline. I'm learning a lot from this exercise; if there are other apparent areas for improvement please let me know and I can iterate over another version.

Brad Chapman 2010-07-22 11:47:15

How is `lazy-avg` lazy and `average` not? Both of them evaluate the whole collection passed to them when they are called.

abhin4v 2010-08-16 11:44:04

The difference is that average calls count which runs down the entire seq, thus the entire seq needs to be in memory at once. lazy-avg carries the cumulative size with it on each iteration, thus after each step the previous elements can be garbage collected.

Alex Taggart 2010-08-16 23:44:32

ansaurus

tags:

views:

answers:

Improving clojure lazy-seq usage for iterative text parsing

related questions