views:

451

answers:

3

Hi All,

I'm learning Clojure and would like some advice on idiomatic usage. As part of a small statistics package, I have a function to calculate the mode of a set of data. (Background: The mode is the most common value in a set of data. There are almost a dozen published algorithms to calculate it. The one used here is from "Fundamentals of Biostatistics" 6th Ed by Bernard Rosner.)

(defn tally-map
 " Create a map where the keys are all of the unique elements in the input
   sequence and the values represent the number of times those elements
   occur. Note that the keys may not be formatted as conventional Clojure
   keys, i.e. a colon preceding a symbol."
  [aseq]
  (apply merge-with + (map (fn [x] {x 1}) aseq)))

(defn mode
 " Calculate the mode. Rosner p. 13. The mode is problematic in that it may
   not be unique and may not exist at all for a particular group of data.
   If there is a single unique mode, it is returned. If there are multiple
   modes, they are returned as a list. If there is no mode, that is all
   elements are present in equal frequency, nil is returned."
  [aseq]
  (let [amap (tally-map aseq)
        mx (apply max (vals amap))
        k (keys amap)
        f (fn [x] (not (nil? x)))
        modes (filter f (map #(if (= mx (get amap %)) %) k))
        ]
    (cond (= 1 (count modes)) (first modes)
      (every? #(= mx %) (vals amap)) nil
      :else modes)
    )
  )

There are a couple of things I have questions about:

  1. The argument. The function accepts a single sequence. Is it more idiomatic to accept a variable number of arguments like the addition function?
  2. Code smell. It seems like the "let" is a bit more complicated than it should be -- so many variable assignments. Have I missed any obvious (or not so obvious) uses of the language or library that would make this method more concise?

Thanks in advance for the help.

+2  A: 

Looks fine to me. I'd replace the

f (fn [x] (not (nil? x)))
mode (filter f (map #(if (= mx (get amap %)) %) k))

with

mode (remove nil? (map #(if (= mx (get amap %)) %) k))

(I don't know why something like not-nil? isn't in clojure.core; it's something one needs every day.)

If there is a single unique mode, it is returned. If there are multiple modes, they are returned as a list. If there is no mode, that is all elements are present in equal frequency, nil is returned."

You could think about simply returning a seq every time (one element or empty is fine); otherwise, the cases have to be differentiated by the calling code. By always returning a seq, your result will magically work as an argument to other functions that expect a seq.

pmf
Thanks for the suggestion. The way the return values were set up was nonsense. It was a short-lived, futile hope of using the function in the same way that I used mean and median, which return a single value.
clartaq
+3  A: 

Here's my take:

  1. There are many core clojure functions that take sequences as arguments while others take multiple arguments, so there is no real idiomatic way in my opinion. If you already have your data in a sequence, I would use a seq as argument, since it will save you a call to apply.

  2. I wouldn't write a function that returns a value in some cases and a list of values in others, because the calling code will always have to check the return value before using it. Instead I would return a single mode as a seq with just one item in it. But you may have your reasons, depending on the code that calls this function.

Apart from that I would rewrite the mode function like this:

(defn mode [aseq]
  (let [amap (tally-map aseq)
        mx (apply max (vals amap))
        modes (map key (filter #(= mx (val %)) amap))
        c (count modes)]
    (cond
      (= c 1) (first modes)
      (= c (count amap)) nil
      :default modes)))

Instead of defining a function f you could use the identity function (unless your data contains values that are logically false). But you don't even need that. I find the modes in a different way, which is more readable to me: The map amap acts as a sequence of map entries (key-value pairs). First I filter only those entries that have the value mx. Then I map the key function on these, giving me a sequence of keys.

To check whether there are any modes I don't loop over the map again. Instead I just compare the number of modes to the number of map entries. If they are equal, all elements have the same frequency!

Here's the function that always returns a seq:

(defn modes [aseq]
  (let [amap (tally-map aseq)
        mx (apply max (vals amap))
        modes (map key (filter #(= mx (val %)) amap))]
    (when (< (count modes) (count amap)) modes)))
Christian Berg
"The function f you define is really the identity function (since nil is logically false)."No, far from it. Compare the results of (map identity [true false nil 1]) and (map #(not (nil? %1)) [true false nil 1]).
pmf
You're right of course, it's not the same function. I meant to say that he could use the identity function in it's place in this example. I'll correct that.
Christian Berg
Thanks for the analysis and suggestion. That is just the shift in perspective I was looking for.
clartaq
+2  A: 

In my opinion, mapping some function over a collection and then immediately condensing the list down to one item is a sign to use reduce.

(defn tally-map [coll]
  (reduce (fn [h n]
            (assoc h n (inc (h n 0))))
          {} coll))

In this case I'd write the mode fn to take a single collection as an argument, as you did. The only reason I can think of to use multiple arguments for a function like this is if you plan to have to type literal arguments a lot.

So if e.g. this is for an interactive REPL script and you're often going to be typing (mode [1 2 1 2 3]) literally, then you should have the function take multiple arguments, to save you from typing the extra [] in the function call all the time. If you plan to read lots of numbers from a file and then take the mode of those numbers, then have the function take a single argument that is a collection so you can save yourself from using apply all the time. I'm guessing your most common use case is the latter. I believe apply also adds overhead that you avoid when you have a function call that takes a collection argument.

I agree with others that you should have mode return a list of results even if there's only one; it'll make your life easier. Maybe rename it modes while you're at it.

Brian Carper
I took your advice and renamed my second function to modes. :-)
Christian Berg
(inc (or (h n) 0)) is the same as (inc (h n 0)):)
Timothy Pratley
Oh, right, I always forget that default value option. Thanks.
Brian Carper
You're right, the most common use case is to pull in piles of numbers and pop out a result. Thanks for the suggestion on improving tally-map. I use that all the time and any improvement is appreciated.
clartaq