How does one process large binary data files in Clojure? Let's assume data/files are about 50MB - small enough to be processed in memory (but not with a naive implementation).
The following code correctly removes ^M from small files but it throws OutOfMemoryError
for larger files (like 6MB):
(defn read-bin-file [file]
(to-byte-array (as-file file)))
(defn remove-cr-from-file [file]
(let [dirty-bytes (read-bin-file file)
clean-bytes (filter #(not (= 13 %)) dirty-bytes)
changed? (< (count clean-bytes) (alength dirty-bytes))] ; OutOfMemoryError
(if changed?
(write-bin-file file clean-bytes)))) ; writing works fine
It seems that Java byte arrays can't be treated as seq as it is extremely inefficient.
On the other hand, solutions with aset
, aget
and areduce
are bloated, ugly and imperative because you can't really use Clojure sequence library.
What am I missing? How does one process large binary data files in Clojure?