views:

55

answers:

2

I'm looking for a data structure with which I can find the most frequently occuring number (among an array of numbers) in a given, variable range.

Let's consider the following 1 based array:

1 2 3 1 1 3 3 3 3 1 1 1 1

If I query the range (1,4), the data structure must retun 1, which occurs twice. Several other examples:

(1,13) = 1

(4,9) = 3

(2,2) = 2

(1,3) = 1 (all of 1,2,3 occur once, so return the first/smallest one. not so important at the moment)

I have searched, but could not find anything similar. I'm looking (ideally) a data structure with minimal space requirement, fast preprocessing, and/or query complexities.

Thanks in advance!

A: 

You could create a binary partition tree where each node represents a histogram map of {value -> frequency} for a given range, and has two child nodes which represent the upper half and lower half of the range.

Querying is then just a case of recursively adding together a small number of these histograms to cover the range required, and scanning the resulting histogram once to find the highest occurrence count.

Useful optimizations include:

  • Using a histogram with mutable frequency counts as an "accumulator" while you add histograms together
  • Stop using precomputed histograms once you get down to a certain size (maybe a range less than the total number of possible values M) and just counting the numbers directly. It's a time/space trade-off that I think will pay off a lot of the time.
  • If you have a fixed small number of possible values, use an array rather than a map to store the frequency counts at each node

UPDATE: my thinking on algorithmic complexity assuming a bounded small number of possible values M and a total of N values in the complete range:

  • Preprocessing is O(N log N) - basically you need to traverse the complete list and build a binary tree, building one node for every M elements in order to amortise the overhead of each node
  • Querying is O(M log N) - basically adding up O(log N) histograms each of size M, plus counting O(M) values on either side of the range
  • Space requirement is O(N) - approx. 2N/M histograms each of size M. The 2 factor is the sum from having N/M histograms at the bottom level, 0.5N/M histograms at the next level, 0.25N/M at the third level etc...
mikera
Can you provide me with preprocessing and query complexities of your idea? Especially preprocessing seemed high for me...
kolistivra
Depends on your data distribution . With M possibles values and N total elements in the array, I think it's M log N for preprocessing and M log N for queries. Space is also M log N. Hence this approach is best (and I believe asymptotically better than all the other solutions posted) in the case that N is very large compared to M.
mikera
Sorry meant to say O(N log N) for preprocessing. That's assuming M is small and bounded compared to N of course.
mikera
Ahh actually it's harder than I thought because of the amortisation. Updated my latest thinking in the answer.
mikera
+1  A: 

Let N be the size of the array and M the number of different values in that array.

I'm considering two complexities : pre-processing and querying an interval of size n, each must be spacial and temporal.


Solution 1 :

  • Spacial : O(1) and O(M)
  • Temporal : O(1) and O(n + M)

No pre-processing, we look at all values of the interval and find the most frequent one.


Solution 2 :

  • Spacial : O(M*N) and O(1)
  • Temporal : O(M*N) and O(min(n,M))

For each position of the array, we have an accumulative array that gives us for each value x, how many times x is in the array before that position.

Given an interval we just need for each x to subtract 2 values to find the number of x in that interval. We iterate over each x and find the maximum value. If n < M we iterate over each value of the interval, otherwise we iterate over all possible values for x.


Solution 3 :

  • Spacial : O(N) and O(1)
  • Temporal : O(N) and O(min(n,M)*log(n))

For each value x build a binary heap of all the position in the array where x is present. The key in your heap is the position but you also store the total number of x between this position and the begin of the array.

Given an interval we just need for each x to subtract 2 values to find the number of x in that interval : in O(log(N)) we can ask the x's heap to find the two positions just before the start/end of the interval and substract the numbers. Basically it needs less space than a histogram but the query in now in O(log(N)).

Loïc Février
Thanks for different solutions, but I'm looking for something with better worst case complexities. +1 for good organization of diverse ideas.
kolistivra
What are your constraints in term of space/time ? I admin that O(min(n,M)*log(n)) can still be big, what is the maximum you can do ?
Loïc Février
Think of this problem as more of theoretical interest of me, rather than a practical thing. I don't know how best we can get, but I keep on thinking =)
kolistivra
Alright ;) I think we can improve time complexity but we'll need more space. If you have a solution with better complexity than mine, I'm interested.
Loïc Février