Statistical Summarization of records

views:

206

answers:

+8 Q:

Statistical Summarization of records

Here is the real world issue that we are solving. We have some rather large data sets that need to be aggregated and summarized in real time with a number of filters and formulas applied to them. It works fine to apply these to each record in real time when the data set is less than 50,000 records but as we approach 100,000 and then 100+ million the overhead of doing realtime math against all of the records gets too great. We have spent a lot of time in SQL doing optimization and then supposing throwing entire datasets in ram and we still come to the conclusion that we need to “zoom out” from the data and summarize groups. We need a way to group like records together and then apply the math to a group of “like records”. This clumping of records allows us to be really fast and doing real time reporting. Our current solution groups record sets that are exactly the same.

Here is an example record

ID77968 1, 43:19.7, 43:19.7, TRUE, 1, 3, 0, 4, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 79,

So if we have 2 of these with exactly the same data

ID77968 1, 43:19.7, 43:19.7, TRUE, 1, 3, 0, 4, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 79,
ID77969 1, 43:19.7, 43:19.7, TRUE, 1, 3, 0, 4, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 79,

then we create a group. Then we can apply the math and logic to a single group and then multiply the outcome by 2 to get the real answer. This works really well for us and can be super helpful in getting around the scale issues of items. That said we now have a new problem. Some of our values have larger ranges of outcomes which is creating datasets of thousands of records with only a couple of them being the exact same. After some brainstorming we came up with an idea of applying some “fuzzy” logic to group things together that are similar. The issue that we have now is that we don’t know the best statistically sound way of going about reducing the record set into groups that aren’t exact the same.

Here is what we need to do. (Simple example, single column)

Suppose we have the following numbers 20 numbers

106
0
8
0
1
0
4
0
3474
0
204
0
75
0
128
0
617
0
20
0

In the above set we have a lot of 0’s so these are easy to group together. But how do I form let’s say 3 more groups. I suppose on the outer bound we have 3474 but given the weighting below that number the outbound group might be something like 2000 and then values 3474 and 617 would be combined to a single group. Our team meeting thought of this as a gravity problem or better known cheerio attraction. Ideally we would find a equation or approach that would let us look at the entire record set and then say..express this in X number of groups. This would allow us to vary the grouping/clumping of the data. So suppose we use the example 20 numbers above and want to express this in 15 groups vs 8 groups we would be able to do this. Now remember that in the example above this is just a single column, but I am trying to group entire records like

ID77968 1, 43:19.7, 43:19.7, TRUE, 1, 3, 0, 4, 1, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 79,

ID77969 1, 43:19.4, 43:19.7, TRUE, 1.2, 3.2, 0, 3, 2, 1, 1, 1, 1, 0, 0, 0, 3, 0.1 0, 0, 3, 14, 179,

Thanks for the help in advance

here is an update based on some of the comments, questions and answers

We currently hash each record as it comes in and then if a record has the same hash, we group it. the issue with hash here is that if it istn exactly the same then, it wont be grouped. This has worked for us for a while becuase our values in each column as relatively bounded. We now have introduced some values that have far greater ranges which has rendered our hash grouping ineffective. Before we were able to take 100mm records and hash them to ajust over 100k groups but now we are seeing neew data in sets that are just 70k with all 70k being unique.Deidentified data here http://bit.ly/c7xQ3N

We need to know more about the algorithms you are applying to the data, because it might be possible to compute something by summing in the new data continually (this might be just a rewording of Eric D.'s comment)

Otherwise, consider running your algorithm on the last n days or months of records, then charting the result over time.

rleir 2010-01-31 10:47:35

John 2010-01-31 22:32:47

Each row is essentially a vector in n-space, and "similar" rows are vectors whose distance is below a given threshold. One way I've tackled something similarly in the past, but on a much smaller scale, has been to use a recursive hash algorithm.

You start with a hashtable that represents the entire range of vectors. You add vectors to the table one at a time, and once you reach a given limit L of vectors per table, the table divides itself into P disjoint subsets that span the vector range. New vectors added to the top-level table are routed to the proper sub-table, and the algorithm recurses. At a specified range, a hashtable will stop dividing and group all of its child vectors into a single vector.

When you've processed the entire database, your end result is a tree of hashtables with areas of increased depth at the denser areas of your dataset. You can decide what groups are needed based on the number of dense areas, giving outliers their own groups and placing dense sets into the same group.

I'm not sure how feasible this would be with the number of rows you're processing, but it seems any solution would require a significant amount of computation.

Jake 2010-01-31 11:07:21

John 2010-01-31 22:33:31

+1 A:

I agree that we need more information to give you the best advice.

If you need real-time analysis of large amounts of incoming data, you may want to look at streaming algorithms in general. Many streaming algorithms do not actually attempt to store all of the incoming data, but rather use each new datapoint to update a live model that you care about. Such algorithms are very useful in cases such as network traffic analysis where high-level trends are more important than each individual packet.

One example would be online k-means which does not actually store each new data point and then run k-means on the entire dataset, but rather updates the current set of means using the new datapoint and then discards it.

Another thing to consider would be some type of lossy compression of related records like vector quantization (VQ) (there are many different flavors of VQ, both supervised (LVQ) and unsupervised). This would allow you to replace "similar" records with a prototype record representative of the group.

awesomo 2010-02-01 05:55:39

ansaurus

tags:

views:

answers:

Statistical Summarization of records

here is an update based on some of the comments, questions and answers

related questions