views:

42

answers:

1

I am being passed a series of keyvaluepair<string, uint> pairs, where the string represents a value and the uint represents how frequently the value has occurred in the source data. I need to be able to hold in memory the x most/least frequently occurring values, along with it's frequency.

x in this case should be reasonably small but I am potentially having to examine several million pairs. Please note also that I am not able to change how I am passed the pairs.

What is the best way to go about this? I'm guessing that having two arrays might be the best bet and as each value is passed, depending on the value, insert it into the sorted array and drop the least/most frequent value out.

+2  A: 

It sounds like you’re searching for the priority queue data structure. Simply build two, one for the most often used pairs and one for the least often used ones, and fill them dynamically and/or retain only a relevant number of values – this is especially easy with priority queues. For example, to only save the ten largest items (pseudo-code):

PriorityQueue pq = new PriorityQueue();

foreach (var kvp in input) {
    pq.Add(kvp);
    if (pq.Count > 10)
        pq.RemoveMin();
}
Konrad Rudolph
Thanks for the pointer Konrad. Works nicely. I used an implementation from the C5 Generic Collections Library (http://www.itu.dk/research/c5/).
dbush