tags:

views:

294

answers:

3

Edit: Added the fact, that the list is sorted, and realizing 'duplicate' is misleading, replaced that with 'redundant' in the title.

I have a sorted list of entries stating a production value in a given interval. Entries stating the exact same value at a later time adds no information and can safely be left out.

case class Entry(minute:Int, production:Double)
val entries = List(Entry(0, 100.0), Entry(5, 100.0), Entry(10, 100.0), Entry(20, 120.0), Entry(30, 100.0), Entry(180, 0.0))

Experimenting with the scala 2.8 collection functions, so far I have this working implementation:

entries.foldRight(List[Entry]()) {
  (entry, list) => list match {
    case head :: tail if (entry.production == head.production) => entry :: tail
    case head :: tail => entry :: list
    case List() => entry :: List()
  }
}
res0: List[Entry] = List(Entry(0,100.0), Entry(20,120.0), Entry(30,100.0), Entry(180,0.0))

Any comments? Am I missing out on some scala magic?

+8  A: 

When you are comparing the consecutive entries in a list, start by zip-ping the list with its tail to get a list of pairs of consecutive elements.

Below, I take the first entry from the list, and use collect to simultaneously filter out pairs where the production is unchanged, and for the remaining pairs, map e2. (collect is new in Scala 2.8, and for a while was called partialMap)

scala> entries.head :: ((entries zip entries.tail).collect { 
           case (Entry(_, p1), e2@Entry(_, p2)) if p1 != p2 => e2 
       })
res13: List[Entry] = List(Entry(0,100.0), Entry(20,120.0), Entry(30,100.0), Entry(180,0.0))

UPDATE For simplicity, this assumes that entries is not empty.

retronym
very nice general idea, zipping with tail. It is somewhat slower than foldright. x2 on my setup (2.8.0.Beta1-RC3, where collect is still 'partialMap', dunno if that affects performance)
andersbohn
@andersbohn You can use `entries.view zip entries.tail` to get better performance out of it (`.toList` in the end), but my tests put your version at 30, `view`'s at 63 and retronym's at 81.
Daniel
A: 

Instead of looking for duplicates for each item, which is O(n^2), or zipping, which is n^2 in memory, use map[Double,Int]. Then just add the items with the 'production' as the key and the 'minute' as the value. The map will ensure unique values for 'production'. You may be able to load the map naturally elsewhere in your code, but even if you have to start with the list as above, loading the map is linear on the list and only O(n log (n)) on the map.

The map will replace as you add "mymap += production -> minute" so to keep the first value, reverse the list before you insert or use a 'contains(key)' guard. The checks will be O(log(n)) so the algorithm overall will be O(n log(n)).

BTW, you could use a map[Double, Entry] to map from production values directly to your Entry structures. Then you can easily get a list out, if needed, by pulling the map's values directly from the map and sorting on either element of the Entry (if needed).

DrGary
I think you're misreading. Andersbohn only needs to go once through the list; it's already sorted, and if a production comes up, changes, and then changes back, you _do_ need the new production. (The point is just to throw out anything you're already doing as redundant.) Both retronym's code and andersbohn's are `O(n)`; they pass once through the data.
Rex Kerr
Perhaps; I don't think the original question was so specific. Hopefully, my answer will be helpful to others with similar questions. Also, searching the whole list every time makes the algorithm O(n^2) in the number of items. That can be improved with a hashtable or tree structure.
DrGary
@DrGary If you'd said something about O(log n) updates, maybe I'd agree. Otherwise, why use a map when you can sort in O(n log n) and then remove duplicates in O(n)?
Rex Kerr
Glad we agree. Use the map to amortize the sort costs over the lifetime of the updates, avoiding resorts.
DrGary
Indeed. Except, do these ever need to be resorted? The original post didn't say they did.
Rex Kerr
+1  A: 

There's a new zipped method with Tuple2 that is more efficient (and lazier) than zip on lists for some operations. You might try this out on your benchmark--I don't know if it's actually faster, but it certainly could be (and it's definitely a lot shorter):

entries.take(1) :::
(entries,entries.drop(1)).zipped.filter(_.production != _.production)._2

Instead of zipping the list pairwise all the way through, it creates a view of the list where the pieces can be manipulated together, and then returns the manipulated lists. Note the use of take and drop to handle the empty case.

It's not super-efficient since it builds two lists when you really only need one, but it is a class of solution that hasn't come up yet.

Rex Kerr