I'm writing a Clojure implementation of this coding challenge, attempting to find the average length of sequence records in Fasta format:
>1
GATCGA
GTC
>2
GCA
>3
AAAAA
For more background see this related StackOverflow post about an Erlang solution.
My beginner Clojure attempt uses lazy-seq to attempt to read in the file one record at a time so it will scale to large files. However it is fairly memory hungry and slow, so I suspect that it's not implemented optimally. Here is a solution using the BioJava library to abstract out the parsing of the records:
(import '(org.biojava.bio.seq.io SeqIOTools))
(use '[clojure.contrib.duck-streams :only (reader)])
(defn seq-lengths [seq-iter]
"Produce a lazy collection of sequence lengths given a BioJava StreamReader"
(lazy-seq
(if (.hasNext seq-iter)
(cons (.length (.nextSequence seq-iter)) (seq-lengths seq-iter)))))
(defn fasta-to-lengths [in-file seq-type]
"Use BioJava to read a Fasta input file as a StreamReader of sequences"
(seq-lengths (SeqIOTools/fileToBiojava "fasta" seq-type (reader in-file))))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths *command-line-args*))))
and an equivalent approach without external libraries:
(use '[clojure.contrib.duck-streams :only (read-lines)])
(defn seq-lengths [lines cur-length]
"Retrieve lengths of sequences in the file using line lengths"
(lazy-seq
(let [cur-line (first lines)
remain-lines (rest lines)]
(if (= nil cur-line) [cur-length]
(if (= \> (first cur-line))
(cons cur-length (seq-lengths remain-lines 0))
(seq-lengths remain-lines (+ cur-length (.length cur-line))))))))
(defn fasta-to-lengths-bland [in-file seq-type]
; pop off first item since it will be everything up to the first >
(rest (seq-lengths (read-lines in-file) 0)))
(defn average [coll]
(/ (reduce + coll) (count coll)))
(when *command-line-args*
(println
(average (apply fasta-to-lengths-bland *command-line-args*))))
The current implementation takes 44 seconds on a large file compared to 7 seconds for a Python implementation. Can you offer any suggestions on speeding the code up and making it more intuitive? Is the usage of lazy-seq correctly parsing the file record by record as intended?