views:

160

answers:

2

Starting with a collection of strings like:

(def str-coll ["abcd" "efgh" "jklm"])

The goal is to extract off a specific number of characters from the head of the string collection, generating a partitioned grouping of strings. This is the desired behavior:

(use '[clojure.contrib.str-utils2 :only (join)])
(partition-all 3 (join "" str-coll))

((\a \b \c) (\d \e \f) (\g \h \j) (\k \l \m))

However, using join forces evaluation of the entire collection, which causes memory issues when dealing with very large collections of strings. My specific use case is generating subsets of strings from a lazy collection generated by parsing a large file of delimited records:

(defn file-coll [in-file]
  (->> (line-seq (reader in-file))
    (partition-by #(.startsWith ^String % ">"))
    (partition 2))))

and is building on work from this previous question. I've tried combinations of reduce, partition and join but can't come up with the right incantation to pull characters from the head of the first string and lazily evaluate subsequent strings as needed. Thanks much for any ideas or pointers.

+5  A: 

Not quite sure what you're going for, but the following does what your first example does, and does so lazily.

Step-by-step for clarity:

user=> (def str-coll ["abcd" "efgh" "jklm"])
#'user/str-coll
user=> (map seq str-coll)
((\a \b \c \d) (\e \f \g \h) (\j \k \l \m))
user=> (flatten *1)
(\a \b \c \d \e \f \g \h \j \k \l \m)
user=> (partition 3 *1)
((\a \b \c) (\d \e \f) (\g \h \j) (\k \l \m))

All together now:

(->> str-coll 
  (map seq)
  flatten
  (partition 3))
Alex Taggart
No need to flatten, just concat the character sequences by using mapcat:(partition-all 3 (mapcat seq str-coll))
Jürgen Hötzel
ataggart and Jürgen, thanks much for the solutions: mapping to a seq was exactly what I was missing. Getting over that hurdle led me to realize that partition-by wasn't acting as lazily as I'd hoped. While each partition is provided in a lazy manner, the individual components of each partition are not; so partitioning the initial file at delimiters does not provide the desired lazy strings that feed into this.
Brad Chapman
@Jürgen: mapcat isn't lazy (it uses apply), hence why I didn't use it.
Alex Taggart
@ataggart: Nope, It is! Just check: (type (mapcat seq str-coll))Why should apply prevent laziness?
Jürgen Hötzel
@Jürgen: mapcat returns a lazy seq, but how that comes into existence isn't fully lazy. See my additional "answer" for more info.
Alex Taggart
Jürgen Hötzel
@Jürgen: concat is irrelevant since the problem is the non-laziness of apply immediately prior to invoking concat. I urge you to read the more detailed answer I provided, and/or look at the implementation of apply.
Alex Taggart
@Jürgen:Since concat works on strings, seq is unnecessary: `(partition-all 3 (apply concat str-coll))` and `(partition-all 3 (mapcat identity str-coll))` both work too :)
Rafał Dowgird
+1  A: 
Alex Taggart
There is no "get all those n lazy seqs". concat is invoked with a "lazy argument list". You can check this in our example practical by setting the string collection to a infinite lazy list:(def str-coll (repeat "abcd"))And then just take part of the result:(take 10 (partition-all 3 (mapcat seq str-coll)))
Jürgen Hötzel
Please see the edit at the bottom of my post for more proof of my claim.
Alex Taggart
Alex Taggart
Though your point regarding apply working with an infinite series has me baffled.
Alex Taggart