views:

217

answers:

5

I need to very efficiently compare two maps in Clojure/Java, and return the difference as determined by Java's .equals(..), with nil/null equivalent to "not present".

i.e. I am looking for the most efficient way to a write a function like:

(map-difference
  {:a 1, :b nil, :c 2, :d 3}
  {:a 1, :b "Hidden", :c 3, :e 5})

=> {:b nil, :c 2, :d 3, :e nil}

I'd prefer an immutable Clojure map as output, but a Java map would also be fine if the performance improvement would be significant.

For what it's worth, my basic test case / expectation of behaviour is that the following will be equal (up to the equivalence of null = "Not present") for any two maps a and b:

a 
(merge b (difference a b))

What would be the best way to implement this?

+1  A: 
  1. Clojure maps will be fine because reading clojure maps is very fast.

  2. I can't answer you but I can give you something to look at. Brenton Ashworth made a testtool where he solved the problem with map compares. Maybe you can look at his code to get hint for implementation. http://formpluslogic.blogspot.com/2010/07/better-clojure-test-results-with-deview.html and http://github.com/brentonashworth/deview

  3. I don't think there is a better implementation that compare the keys and look up if the are different.

nickik
Please fix the syntax of your third point; it does not make sense.
Svante
+2  A: 

In Java, Google Commons Collections offer a quite performant solution.

Sylar
Thanks! This is useful although somewhat more general than what I was looking for (it produces a full set of map comparisons, not the simple difference map that I am looking for)
mikera
+2  A: 

Use the built-in collections API:

Set<Map.Entry<K,V>> difference = a.entrySet().removeAll(b.entrySet());

If you need to convert that back into a map, you must iterate. In that case, I suggest:

Map<K,V> result = new HashMap<K,V>(Math.max(a.size()), b.size()));
Set<Map.Entry<K,V>> filter = b.entrySet();
for( Map.Entry<K,V> entry : a.entrySet ) {
    if( !filter.contains( entry ) {
        result.put(entry.getKey(), entry.getValue());
    }
}
Aaron Digulla
+7  A: 

I'm not sure what the absolutely most efficient way to do this is, but here's a couple of things which may be useful:

  1. The basic expectation of behaviour from the question text is impossible: if a and b are maps such that b contains at least one key not present in a, (merge b <sth>) cannot be equal to a.

  2. If you end up going with an interop solution but then need to go back to a PersistentHashMap at some point, there's always

    (clojure.lang.PersistentHashMap/create
     (doto (java.util.HashMap.)
       (.put :foo 1)
       (.put :bar 2)))
    ; => {:foo 1 :bar 2}
    
  3. If you need to pass the keyset of a Clojure map to a Java method, you can use

    (.keySet {:foo 1 :bar 2})
    ; => #< [:foo, :bar]>
    
  4. If all keys involved are guaranteed to be Comparable, this could be exploited for efficient computation of difference on maps with many keys (sort & merge scan). For unconstrained keys this is of course a no-go and for small maps it could actually hurt performance.

  5. It's good to have a version written in Clojure, if only to set a baseline performance expectation. Here is one: (updated)

    (defn map-difference [m1 m2]
            (loop [m (transient {})
                   ks (concat (keys m1) (keys m2))]
              (if-let [k (first ks)]
                (let [e1 (find m1 k)
                      e2 (find m2 k)]
                  (cond (and e1 e2 (not= (e1 1) (e2 1))) (recur (assoc! m k (e1 1)) (next ks))
                        (not e1) (recur (assoc! m k (e2 1)) (next ks))
                        (not e2) (recur (assoc! m k (e1 1)) (next ks))
                        :else    (recur m (next ks))))
                (persistent! m))))
    

    I think that just doing (concat (keys m1) (keys m2)) and possibly duplicating some work is likely more efficient most of the time than checking a given key is in "the other map" too at every step.

To wrap up the answer, here's a very simple-minded set-based version with the nice property that it says what it does -- if I misunderstood the spec, it should be readily apparent here. :-)

(defn map-difference [m1 m2]
  (let [ks1 (set (keys m1))
        ks2 (set (keys m2))
        ks1-ks2 (set/difference ks1 ks2)
        ks2-ks1 (set/difference ks2 ks1)
        ks1*ks2 (set/intersection ks1 ks2)]
    (merge (select-keys m1 ks1-ks2)
           (select-keys m2 ks2-ks1)
           (select-keys m1
                        (remove (fn [k] (= (m1 k) (m2 k)))
                                ks1*ks2)))))
Michał Marczyk
Fantastic answer Michal, many thanks! Just one point regarding 1), if you specify that nil is equivalent to not present (as per the question), I think that the requirement is possible by having map-difference assign nil to the key that needs to be "removed".
mikera
I hope you are aware of defrecords? Maybe the are a solution if you don't need tha maps to be generic.
nickik
@mikera: Thanks, happy to hear that. :-) As for 1), that's a good point and it should be a minor tweak to any solution -- thanks for the correction! @nickik: I'm not sure what you have in mind. Could you elaborate?
Michał Marczyk
@Michal Marczyk - What happened to your elegant first baseline solution? I wanted to have a look at it, and it's gone now.
Adam Schmideg
@Adam Schmideg: I discovered it didn't match the spec and edited it out. You can see it in the revision history for this answer (linked to from the "such and such time ago" part of the "edited such and such time ago" notice immediately below the answer text).
Michał Marczyk
@Michal - thank you
Adam Schmideg
+2  A: 

I am not sure about its performance

(defn map-difference
  [orig other]
  (let [changed (set/difference (set orig) (set other))
        added (set/difference (set (keys other)) (set (keys orig)))]
    (reduce (fn [acc key]
              (assoc acc key :missing))
      (into {} changed)
      added)))

I used :missing key to avoid ambiguity between a nil value in the original map, and a missing key -- you can of course change it to nil.

Adam Schmideg