tags:

views:

121

answers:

7

I have a dataset whose columns look like this:

Consumer ID | Product ID | Time Period | Product Score
1           | 1          | 1           | 2
2           | 1          | 2           | 3

and so on.

As part of a program (written in C) I need to process the product scores given by all consumers for a particular product and time period combination for all possible combinations. Suppose that there are 3 products and 2 time periods. Then I need to process the product scores for all possible combinations as shown below:

Product ID | Time Period 
1          | 1
1          | 2
2          | 1
2          | 2
3          | 1
3          | 2

I will need to process the data along the above lines lots of times (> 10k) and the dataset is fairly large (e.g., 48k consumers, 100 products, 24 time periods etc). So speed is an issue.

I came up with two ways to process the data and am wondering which is the faster approach or perhaps it does not matter much? (speed matters but not at the cost of undue maintenance/readability):

  1. Sort the data on product id and time period and then loop through the data to extract data for all possible combinations.

  2. Store the consumer ids of all consumers who provided product scores for a particular combination of product id and time period and process the data accordingly.

Any thoughts? Any other way to speed up the processing? Thanks

+3  A: 

As with many performance-related questions, the only real, definitive answer is to write a benchmark. Speed will depend on many things, and it doesn't sound like you're talking about a straightforward case of a linear algorithm vs a quadratic algorithm (and even that would have an additional dependency on input size).

So implement both methods, run them on sample data, and time the results. This will be much faster and more conclusive than us trying to work it out in our heads with limited information.

danben
Does the downvoter care to comment?
danben
Sorry, I addressed it in my answer and not as a comment.
Jeremy Roberts
that's one way to go but I was hoping someone be able to provide some insight!
Anon
You can have guesses, or you can know for sure.
danben
A: 

I would suggest filter the data, like in step two, then processes as per step one. If your performance is unacceptable, tune for performance. Set some benchmarks for your baseline, then try some different approaches.

In most real world situations, I would advise against implementing multiple methods simply for the sake of benchmarking. The performance is likely to be similar. If its not similar, its probably performing poorly and will be in obvious need of tuning. Your time is probably better spent implementing other features.

Jeremy Roberts
A: 

That would make for a smallish database table. It's about 0.4GB of data if the full matrix of consumers/products/time exist. Have you considered running the whole thing in SQL? Even if you don't us a full up database; for that size of data, it would be practical to generate a full table for each of the possible sort orders and dump each to a file. You could then load whatever file you need to walk it in whatever order you choose.

Also, if you can run the full 10K passes in parallel or at least a few dozen per pass, you might be ahead to do so as it may drastically reduce your IO waits and/or cache misses.

BCS
Isn't SQL overkill for this type of a problem especially if I can load all the data into memory?
Anon
If you have a scratch/test/sandbox server around (or even a SQLite tool), not at all. There is no such thing as overkill, just things that aren't worth the effort. In this case, if you can go it all in SQL, that will save you all the data handling work. If you can do the SQL IO easier than a roll your own solution, you're still ahead.
BCS
A: 

Actually the two methods looks very similar to me. In order to store the costumer id of all customers who provided score for a specific combination you need to sort the data or perform a more expensive operation.

Can you exchange space for time? If yes then do no pre-processing what so ever, but create an array of all combinations (10x24) to store the scores. Process the data as they come and update the score of the specific combination. If you need the average score, store both the sum and the number of customers that provided score.

kgiannakakis
I do not have scores coming in across time. The data does not arrive in real-time.
Anon
A: 

The slowest part that you have any influence over would probably be copying chunks of memory. So the first technique to apply would be to place the values for each row in a structure and refer to it by pointer only until all you processing is complete. The structures would be something like:

typedef struct {
 int consumer;
 int product;
 int time;
 int score;
} rowData;

Building upon that I would think that you would be best to loop through the input rows and build up a binary tree (or other sorted structure) of structures that are identified by consumer and product, and contain a look up table of pointers to all matching rowData:

typedef struct {
 int consumer;
 int product;
 rowData * matches;
} matchLut;

Once all the rows have been placed in look up tables on the tree then each bundle can be processed.

youngthing
The memory allocation of flexible arrays should be handled intelligently, though I have made no mention of how this may be done!
youngthing
yeah I have something similar in mind...
Anon
A: 

If memory permits store your data in a 2d array (really 3d, but I'll get to that later). This array will be indexed by (product_id, time_period).

If your processing of the data permits it each element of the 2D array could be an accumulator of the new data so you read in a data element and then adjust the corresponding element of the the 2D array to reflect it. If this method works your data will be processed when you finish reading it in.

If your processing requires you to have data from each data element present at one time then you could make each element of your 2D array a list (this is the 3rd D). It could be a variable length list if you don't know how many customer entries will be present for each (product_id, time_period). After you have read in your data you will need to revisit each element of the 2D array to process each list. How you arrange your array and how you visit the elements will matter for performance. You will probably want to declare this dynamically, but for this example

struct element_t element[NUMBER_OF_PRODUCTS][NUMBER_OF_TIME_PERIODS];
// don't forget to initialize these elements to empty
...
for (p = max_product_id; p >= 0; p--) {
    for (t = max_time_period; t >= 0; t--) {
         process(element[p][t]);
    }
}

Will work better if you want to process each product before moving to the next because. You can swap the declaration's row,column and the loops to achieve better cache hits if you want to process each time period (for all products) before moving to the next.

You should note that this does the sorting for you without saying "sort this data".

If memory does not permit, then you will probably want to store parts of your data to files as you read it in. This will have the same issues as the array/loop organization/cache hit optimization mentioned above, but it will be magnified many times over. At the end of reading in your main data you would want to be able to process all data from a particular temp file (possibly containing all data for a given product (xOR for a given time period)) before moving to the next. The main bad part of trying to do this is that when you are reading in the data it is very likely that you will have to deal with not being able to have every temp file open at the same time. This might require you to come up with a way of doing open file swapping (same as memory swapping, except what you are swapping is open files rather than memory pages). This would be a whole other problem, though.

nategoose
memory is not an issue. Your suggestion is what I had in mind for Step 2 in my question. Thanks
Anon
A: 

I suggest you reorder your data according to the most frequently accessed processes. The most frequently accessed data should be easiest and fastest to access.

Also, take a look at Database Normalization. This is a concept of organizing data for least amount of duplication, and also makes accessing the data more efficient.

Another idea is to use indices for less popular data searches.

Thomas Matthews
I need to access all combinations equally often. I am aware of data normalization but using SQL or some such seems to be an overkill here.
Anon
A database may be overkill, but there may be some benefit to data normalization in the program.
Thomas Matthews