tags:

views:

597

answers:

6

We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?

Edit: There are tons of codes - it's possible that every code is different.

EPILOGUE: I've selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory - like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.

But it's a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It's actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)

Thanks for the help!

+3  A: 

this will depend on the distribution of the codes. if there are a small enough number of distinct codes you can build a http://en.wikipedia.org/wiki/Frequency_distribution in core with a map. otherwise you probably will have to build a http://en.wikipedia.org/wiki/Histogram and then make multiple passes over the data examining frequencies of codes in each bucket.

Ray Tayek
Um, NO. The whole point of streaming/sketching algorithms is that you can't keep a histogram, because the data is so huge.
ShreevatsaR
S/he's talking about using multiple passes to hunt down intervals with a high count - my problem is just that the number of passes that would entail.
Gwildore
feels like you should be able to build a histogram if your bucket (bin) size is large enough: http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width
Ray Tayek
+1  A: 

That depends on how many different codes exist, and how much memory you have available.

My first idea would be to build a hash table of counters, with the codes as keys. Loop through the entire file, increasing the counter of the respective code, and counting the overall number. Finally, filter all keys with counters that exceed (* overall-counter 1/200).

Svante
I don't have enough memory for this - every code could in theory be different.
Gwildore
+1  A: 

If the files consist solely of 16-byte codes, and you know how large each file is, you can calculate the number of codes in each file. Then you can find the 0.5% threshold and follow any of the other suggestions to count the occurrences of each code, recording each one whose frequency crosses the threshold.

Adam Liss
+1  A: 

Do the contents of each file represent a single data set, or is there an arbitrary cutoff between files? In the latter case, and assuming a fairly constant distribution of codes over time, you can make your life simpler by splitting each file into smaller, more manageable chunks. As a bonus, you'll get preliminary results faster and can pipeline then into the next process earlier.

Adam Liss
I could do it as data is gathered, but I do want to have the right answer for the whole 50 gig file.
Gwildore
+2  A: 

Sort chunks of the file in memory, as if you were performing and external sort. Rather than writing out all of the sorted codes in each chunk, however, you can just write each distinct code and the number of occurrences in that chunk. Finally, merge these summary records to find the number of occurrences of each code.

This process scales to any size data, and it only makes one pass over the input data. Multiple merge passes may be required, depending on how many summary files you want to open at once.


Sorting the file allows you to count the number of occurrences of each code using a fixed amount of memory, regardless of the input size.

You also know the total number of codes (either by dividing the input size by a fixed code size, or by counting the number of variable length codes during the sorting pass in a more general problem).

So, you know the proportion of the input associated with each code.

This is basically the pipeline sort * | uniq -c

If every code appears just once, that's no problem; you just need to be able to count them.

erickson
If every code occurs exactly once, then your merge step can't make any progress, can it?
Gwildore
+9  A: 

Abbadi. Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can't find now; but this looks like a good start.

Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you're looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.

Darius Bacon
Interesting paper, but slightly different problem. I want an exact answer (which I think now can be done).
Gwildore
There was a paper with an exact answer, that proved its method was in some sense optimal, but I'm blanking on the name; it's been a few years and I no longer work there.
Darius Bacon
This does give all candidates, so you can make a simple second pass, counting only the candidates.
Svante