views:

209

answers:

2

Assumed infinite storage where size/volume/physics (metrics,gigabytes/terrabytes) won't matter only the number of elements and their labels, statistically pattern should emerge already at 30 subsets, but can you agree that less than 1000 subsets is too little to test, and at least 10000 distinct subsets / "elements", "entries" / entities is "a large data set". Or larger? Thanks

+3  A: 

I'm not sure I understand your question, but it sounds like you are attempting to ask about how many elements of data set you need to sample in order to ensure a certain degree of accuracy (30 is a magic number from the Central Limit Theorem that comes in to play frequently).

If that is the case, the sample size you need depends on the confidence level and confidence interval. If you want a 95% confidence level and a 5% confidence interval (i.e. you want to be 95% confident that the proportion you determine from your sample is within 5% of the proportion in the full data set), you end up needing a sample size of no more than 385 elements. The greater the confidence level and the smaller the confidence interval that you want to generate, the larger the sample size you need.

Here is a nice discussion on the mathematics of determining sample size and a handy sample size calculator if you just want to run the numbers.

Justin Cave
A: 

Thank you for the useful anwers. Specific in context mostly measuring response time for most common operations i.e. fetch a subset, sort a subset for operating system, data storage, transport and presentation layer- For instance to know how an installation is very fast when new, and after some time and data accumulation, the system responds "significantly" slower. A clean installation with a database of 100 text rows is much faster than the same with 10000 text rows and determinining whether it's mostly due to number of elements, disk fragmentations/disk use or other variable. I accumulated 8000 database articles and the whole service is rather slower than in mint condition, which also probably has to do with using ½ the RAM, switching app server and moving from dedicated to virtual hosting for security purposes that must balance with performance purposes.

Larsson

LarsOn
You should edit the original question if you have additional things to say, and comment if you want to thank others. Thanks!
phihag