ansaurus

Question

How to get make stats in constant memory

Answer 1

+2 A:

So you have an unbounded number of possible results and want to count how many times each appears in constant memory. That is clearly impossible to do exactly, but the data structure called count-min sketch can be used to do a pretty good approximation. In your case, store the results in a count-min sketch while keeping track of the minimum and maximum value separately, and at the end query the count-min sketch for every integer from the minimum to maximum.

Jouni K. Seppänen 2010-10-24 14:32:01

Usually hashtable-like methods are not preferred in Haskell, because they depend strongly on impurity. So you lose composability, which is one of Haskell's biggest strengths. Often you can reformulate such methods in terms of trees, or a combination of hashes and trees. But the authors of CM weren't thinking about the problem of impurity, and it is fairly recent work, so I doubt anyone else has done that yet either. You could implement it using impure methods in Haskell, but in that case, you might as well link to the C library using FFI.

Yitz 2010-10-24 15:06:25

Somebody has implemented the Bloom filter with persistent vectors, and the CM sketch is a generalization of that, so you could probably do something similar: http://www.codecommit.com/blog/scala/bloom-filters-in-scala It won't be constant-memory, though.

Jouni K. Seppänen 2010-10-24 15:45:45

As I sad, the integer is in a range a, b, which has a size of about 50, so I can simply keep it as an array.

FUZxxl 2010-10-25 04:27:44

Answer 2

A:

As already mentioned by Jouni, Constant memory is not possible, but this count-min sketch sounds like the bomb! (although I have not heard of it before). But what I think you might have asked for is the possibility to store this in one array and only update each frequency. This can be done in haskell with Mutable arrays. Here is an example:

main = do gen <- newStdGen
          n <- liftM (read . head) getArgs
          arr  <- (newArray (a,b) 0) :: IO (IOUArray Int Int)
          replicateM_ n $ do 
               result <- myRand
               x <- readArray arr result
               writeArray arr result (x+1)
          (getAssocs arr :: IO [(Int,Int)]) >>= print

Running the program with +RTS -s , and input 1000000 we get the output

787,874,256 bytes allocated in the heap
         364,536 bytes copied during GC
           5,984 bytes maximum residency (1 sample(s))
          17,928 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

...
  Total time    0.29s  (  0.30s elapsed)
...
  %GC time       0.3%  (2.1% elapsed)

HaskellElephant 2010-10-24 15:12:05

sounds good... If I understood right, the mem depends only on the length of the array?

FUZxxl 2010-10-25 04:32:59

@FUZxxl , yes...

HaskellElephant 2010-10-25 07:12:32

Answer 3

+3 A:

The way I usually handle this kind of problem is to keep track of counts in a map. Data.IntMap works in this case:

import qualified Data.IntMap as I

results :: [Int] -> I.IntMap Int
results = foldr (\x -> I.insertWith (+) x 1) I.empty

At this point you can ask for the endpoints of the range (I.findMin and I.findMax) or look up the count at a particular value in O(log n). It's also very easy to stick everything in an array for faster lookups.

UPDATE: See luqui's answer for a much better version of this code.

Travis Brown 2010-10-24 15:15:13

Answer 4

+8 A:

import qualified Data.Map as Map
import Data.List (foldl')          -- ' (to fix SO syntax highlighting)

histogram :: (Ord a) => [a] -> Map.Map a Int
histogram = foldl' (\m x -> Map.insertWith' (+) x 1 m) Map.empty

The explanation for why this works and why it is superior to Travis Brown's solution is pretty technical, and will require some patience to understand fully.

If there are only finitely many values that can possibly occur in the list, then this runs in constant memory. Travis's solution has a subtle bug in which the resulting map entries will look like:

(4, 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1)

A very inefficient representation of the number 19. Only when you ask for that element in the map will the giant sum be computed. These "thunks" (delayed-evaluation expressions) will grow linearly with the size of the input.

To prevent this, we use insertWith', which applies the function strictly, that is to say it evaluates the result before it puts it in the map. So then if you insert 4 into the map above, it will evaluate the thunk and you will get a nice tidy:

(4, 20)

And another will evaluate that before adding so you will get:

(4, 21)

So now at least the values of the map are constant space.

The final thing we need to do is to change the right fold to a left fold because Map.insert is strict in its second argument. The following demonstrates the meaning of a right fold.

iw x m = Map.insertWith' (+) x 1 m    -- '

foldr iw Map.empty [1,2,1,3,2,1]
    = iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 Map.empty)))))

Using iw as a simple shorthand. Map.insert being strict in its second argument means you need to evaluate the map into which you are inserting before insert can do any work. I will use the notation { k1 -> v1, k2 -> v2, ... } as a shorthand for maps. Your sequence of evaluation looks like this:

foldr f z [] = z
foldr f z (x:xs) = f x (foldr f z xs)

foldr iw {} [1,2,1,3,2,1]
iw 1 (foldr iw {} [2,1,3,2,1])
iw 1 (iw 2 (foldr iw {} [1,3,2,1]))
iw 1 (iw 2 (iw 1 (foldr iw {} [3,2,1])))
iw 1 (iw 2 (iw 1 (iw 3 (foldr iw {} [2,1]))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (foldr iw {} [1])))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 (foldr iw {} []))))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 {}))))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 {1 -> 1}))))
iw 1 (iw 2 (iw 1 (iw 3 {1 -> 1, 2 -> 1})))
iw 1 (iw 2 (iw 1 {1 -> 1, 2 -> 1, 3 -> 1}))
iw 1 (iw 2 {1 -> 2, 2 -> 1, 3 -> 1})
iw 1 {1 -> 2, 2 -> 2, 3 -> 1}
{1 -> 3, 2 -> 2, 3 -> 1}

So if you have a 1,000,000 element array, we have to go all the way down to the 1,000,000th element to start inserting, thus we need to keep the previous 999,999 elements in memory so we can know what to do next. A left fold solves this:

-- definition of left fold
foldl' f z xs = go z xs             -- '
    where 
    go accum [] = z
    go accum (x:xs) = accum `seq` go (f accum x) xs

foldl' (flip iw) Map.empty [1,2,1,3,2,1]  -- needed to flip arg order to appease foldl'
go {} [1,2,1,3,2,1]
go (iw 1 {}) [2,1,3,2,1]
go (iw 2 {1 -> 1}) [1,3,2,1]
go (iw 1 {1 -> 1, 2 -> 1}) [3,2,1]
go (iw 3 {1 -> 2, 2 -> 1}) [2,1]
go (iw 2 {1 -> 2, 2 -> 1, 3 -> 1}) [1]
go (iw 1 {1 -> 2, 2 -> 2, 3 -> 1}) []
iw 1 {1 -> 2, 2 -> 2, 3 -> 1}
{1 -> 3, 2 -> 2, 3 -> 1}

Now we can see that, finally, if the number of entries in the map is bounded, then this runs in constant space and linear time.

luqui 2010-10-25 09:01:42

This is, what I needed. Big thank you!!!

FUZxxl 2010-10-26 04:59:02

ansaurus

tags:

views:

answers:

How to get make stats in constant memory

Edit:

related questions