views:

185

answers:

4

I have a function, which creates some random numerical results. I know, that the result will be an integer in a (small, a - b approx 50) range a, b. I want to create a function which execute the above function let's say 1000000 times and calculates, how often the each result appears. (The function takes a random generator to produce the result.) The problem is, I don't know how to do this in constant memory without hard-coding the range's length. My (bad) approach is like this:

values :: [Int]
values = doFunctionNtimes myRandom 1000000
results = map (\x ->length . filter (x==) $ values) [a..b]

Anybody an idea to do this?

Edit:

I think I explained the problem wrong, sorry for this. I have a function, which - depending on a random gen - gives out some small int value. To make a statistic, I want to know, how often the results appear. As I want to make stats over let's say 1000000 tries, I need constant memory over the number of tries.

+2  A: 

So you have an unbounded number of possible results and want to count how many times each appears in constant memory. That is clearly impossible to do exactly, but the data structure called count-min sketch can be used to do a pretty good approximation. In your case, store the results in a count-min sketch while keeping track of the minimum and maximum value separately, and at the end query the count-min sketch for every integer from the minimum to maximum.

Jouni K. Seppänen
Usually hashtable-like methods are not preferred in Haskell, because they depend strongly on impurity. So you lose composability, which is one of Haskell's biggest strengths. Often you can reformulate such methods in terms of trees, or a combination of hashes and trees. But the authors of CM weren't thinking about the problem of impurity, and it is fairly recent work, so I doubt anyone else has done that yet either. You could implement it using impure methods in Haskell, but in that case, you might as well link to the C library using FFI.
Yitz
Somebody has implemented the Bloom filter with persistent vectors, and the CM sketch is a generalization of that, so you could probably do something similar: http://www.codecommit.com/blog/scala/bloom-filters-in-scala It won't be constant-memory, though.
Jouni K. Seppänen
As I sad, the integer is in a range a, b, which has a size of about 50, so I can simply keep it as an array.
FUZxxl
A: 

As already mentioned by Jouni, Constant memory is not possible, but this count-min sketch sounds like the bomb! (although I have not heard of it before). But what I think you might have asked for is the possibility to store this in one array and only update each frequency. This can be done in haskell with Mutable arrays. Here is an example:

main = do gen <- newStdGen
          n <- liftM (read . head) getArgs
          arr  <- (newArray (a,b) 0) :: IO (IOUArray Int Int)
          replicateM_ n $ do 
               result <- myRand
               x <- readArray arr result
               writeArray arr result (x+1)
          (getAssocs arr :: IO [(Int,Int)]) >>= print

Running the program with +RTS -s , and input 1000000 we get the output

787,874,256 bytes allocated in the heap
         364,536 bytes copied during GC
           5,984 bytes maximum residency (1 sample(s))
          17,928 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

...
  Total time    0.29s  (  0.30s elapsed)
...
  %GC time       0.3%  (2.1% elapsed)
HaskellElephant
sounds good... If I understood right, the mem depends only on the length of the array?
FUZxxl
@FUZxxl , yes...
HaskellElephant
+3  A: 

The way I usually handle this kind of problem is to keep track of counts in a map. Data.IntMap works in this case:

import qualified Data.IntMap as I

results :: [Int] -> I.IntMap Int
results = foldr (\x -> I.insertWith (+) x 1) I.empty

At this point you can ask for the endpoints of the range (I.findMin and I.findMax) or look up the count at a particular value in O(log n). It's also very easy to stick everything in an array for faster lookups.


UPDATE: See luqui's answer for a much better version of this code.

Travis Brown
+8  A: 
import qualified Data.Map as Map
import Data.List (foldl')          -- ' (to fix SO syntax highlighting)

histogram :: (Ord a) => [a] -> Map.Map a Int
histogram = foldl' (\m x -> Map.insertWith' (+) x 1 m) Map.empty

The explanation for why this works and why it is superior to Travis Brown's solution is pretty technical, and will require some patience to understand fully.

If there are only finitely many values that can possibly occur in the list, then this runs in constant memory. Travis's solution has a subtle bug in which the resulting map entries will look like:

(4, 1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1)

A very inefficient representation of the number 19. Only when you ask for that element in the map will the giant sum be computed. These "thunks" (delayed-evaluation expressions) will grow linearly with the size of the input.

To prevent this, we use insertWith', which applies the function strictly, that is to say it evaluates the result before it puts it in the map. So then if you insert 4 into the map above, it will evaluate the thunk and you will get a nice tidy:

(4, 20)

And another will evaluate that before adding so you will get:

(4, 21)

So now at least the values of the map are constant space.

The final thing we need to do is to change the right fold to a left fold because Map.insert is strict in its second argument. The following demonstrates the meaning of a right fold.

iw x m = Map.insertWith' (+) x 1 m    -- '

foldr iw Map.empty [1,2,1,3,2,1]
    = iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 Map.empty)))))

Using iw as a simple shorthand. Map.insert being strict in its second argument means you need to evaluate the map into which you are inserting before insert can do any work. I will use the notation { k1 -> v1, k2 -> v2, ... } as a shorthand for maps. Your sequence of evaluation looks like this:

foldr f z [] = z
foldr f z (x:xs) = f x (foldr f z xs)

foldr iw {} [1,2,1,3,2,1]
iw 1 (foldr iw {} [2,1,3,2,1])
iw 1 (iw 2 (foldr iw {} [1,3,2,1]))
iw 1 (iw 2 (iw 1 (foldr iw {} [3,2,1])))
iw 1 (iw 2 (iw 1 (iw 3 (foldr iw {} [2,1]))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (foldr iw {} [1])))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 (foldr iw {} []))))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 (iw 1 {}))))))
iw 1 (iw 2 (iw 1 (iw 3 (iw 2 {1 -> 1}))))
iw 1 (iw 2 (iw 1 (iw 3 {1 -> 1, 2 -> 1})))
iw 1 (iw 2 (iw 1 {1 -> 1, 2 -> 1, 3 -> 1}))
iw 1 (iw 2 {1 -> 2, 2 -> 1, 3 -> 1})
iw 1 {1 -> 2, 2 -> 2, 3 -> 1}
{1 -> 3, 2 -> 2, 3 -> 1}

So if you have a 1,000,000 element array, we have to go all the way down to the 1,000,000th element to start inserting, thus we need to keep the previous 999,999 elements in memory so we can know what to do next. A left fold solves this:

-- definition of left fold
foldl' f z xs = go z xs             -- '
    where 
    go accum [] = z
    go accum (x:xs) = accum `seq` go (f accum x) xs

foldl' (flip iw) Map.empty [1,2,1,3,2,1]  -- needed to flip arg order to appease foldl'
go {} [1,2,1,3,2,1]
go (iw 1 {}) [2,1,3,2,1]
go (iw 2 {1 -> 1}) [1,3,2,1]
go (iw 1 {1 -> 1, 2 -> 1}) [3,2,1]
go (iw 3 {1 -> 2, 2 -> 1}) [2,1]
go (iw 2 {1 -> 2, 2 -> 1, 3 -> 1}) [1]
go (iw 1 {1 -> 2, 2 -> 2, 3 -> 1}) []
iw 1 {1 -> 2, 2 -> 2, 3 -> 1}
{1 -> 3, 2 -> 2, 3 -> 1}

Now we can see that, finally, if the number of entries in the map is bounded, then this runs in constant space and linear time.

luqui
This is, what I needed. Big thank you!!!
FUZxxl