views:

185

answers:

2

I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format:

>1
GATCGA
GTC
>2
GCA
>3
AAAAA

For more background see this related StackOverflow post about an Erlang solution.

My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using the BioJava library to abstract out the parsing of the records:

(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])

(defn seq-lengths [seq-iter]
  "Produce a lazy collection of sequence lengths given a BioJava StreamReader"
  (lazy-seq
    (if (.hasNext seq-iter)
      (cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))

(defn fasta-to-lengths [in-file seq-type]
  "Use BioJava to read a Fasta input file as a StreamReader of sequences"
  (seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths *command-line-args*))))

and an equivalent approach without external libraries:

(use '[clojure.contrib.duck-streams :only (read-lines)])

(defn seq-lengths [lines cur-length]
  "Retrieve lengths of sequences in the file using line lengths"
  (lazy-seq
    (let [cur-line (first lines)
          remain-lines (rest lines)]
      (if (= nil cur-line) [cur-length]
        (if (= \> (first cur-line))
          (cons cur-length (seq-lengths remain-lines 0))
          (seq-lengths remain-lines (+ cur-length (.length cur-line))))))))

(defn fasta-to-lengths-bland [in-file seq-type]
  ; pop off first item since it will be everything up to the first >
  (rest (seq-lengths (read-lines in-file) 0)))

(defn average [coll]
  (/ (reduce + coll) (count coll)))

(when *command-line-args*
  (println
    (average (apply fasta-to-lengths-bland *command-line-args*))))

The current implementation takes 44 seconds on a large file compared to 7 seconds for a Python implementation. Can you offer any suggestions on speeding the code up and making it more intuitive? Is the usage of lazy-seq correctly parsing the file record by record as intended?

+1  A: 

Your average function is non-lazy -- it needs to realise the entire coll argument while holding onto its head. Update: Just realised that my original answer included a nonsensical suggestion as to how to solve the above problem... argh. Fortunately ataggart has since posted a correct solution.

Other than that, your code does seem lazy at first glance, though the use of read-lines is currently discouraged (use line-seq instead).

If the file is really large and your functions will be called a large number of times, type-hinting seq-iter in the argument vector of seq-length -- ^NameOfBiojavaSeqIterClass seq-iter, use #^ in place of ^ if you're on Clojure 1.1 -- might make a significant difference. In fact, (set! *warn-on-reflection* true), then compile your code and add type hints to remove all reflection warnings.

Michał Marczyk
+2  A: 

It probably doesn't matter, but average is holding onto the head of the seq of lengths.
The following is a wholly untested, but lazier way to do what I think you want.

(use 'clojure.java.io) ;' since 1.2

(defn lazy-avg [coll]
  (let [f (fn [[v c] val] [(+ v c) (inc c)])
        [sum cnt] (reduce f [0 0] coll)]
    (if (zero? cnt) 0 (/ sum cnt)))

(defn fasta-avg [f]
  (->> (reader f) 
    line-seq
    (filter #(not (.startsWith % ">")))
    (map #(.length %))
    lazy-avg))
Alex Taggart
Actually I think this is quite likely to matter with large datasets... I even posted an answer pointing this out previously, but then suggested a ridiculous way of dealing with the problem in a momentary lapse of reason -- +1 for the correct solution to this one.
Michał Marczyk
I think groups lines between the >s are to be considered as single records, though; something like `(partition-by #(.startsWith ^String % ">"))` might help. The general idea remains the same.
Michał Marczyk
ataggart and Michal, many thanks for these pointers. With them, here is a cleaner version that finishes in 1/4 the time: http://gist.github.com/485853. This is about 2x slower than my Python version, including JVM start up time from the commandline. I'm learning a lot from this exercise; if there are other apparent areas for improvement please let me know and I can iterate over another version.
Brad Chapman
How is `lazy-avg` lazy and `average` not? Both of them evaluate the whole collection passed to them when they are called.
abhin4v
The difference is that average calls count which runs down the entire seq, thus the entire seq needs to be in memory at once. lazy-avg carries the cumulative size with it on each iteration, thus after each step the previous elements can be garbage collected.
Alex Taggart