ansaurus

Question

When calculating trends, how do you account for low sample size?

Answer 1

+3 A:

This is really a statistics question. I'm not a statistician, but I suspect the answer is along the lines of well, you have no data — what do you expect‽

Perhaps you could merge Dubbo with a nearby region? You've sliced your data small enough that your signal has fallen below noise.

You could also just not show Dubbo, or make a color for not enough data.

derobert 2009-09-24 06:53:08

+1 for the interrobang

nickf 2009-09-24 06:57:32

Answer 2

A:

With a heat map you are generally attempting to show easily assimilated information. Anything too complex would probably be counter-productive.

In the case of Dubbo, the reality is that you don't have the data to draw any firm conclusions about it, so I'd color it white, say. You could possibly label it with the difference/current value too.

I think this would be preferable to possibly misleading the users.

dommer 2009-09-24 06:53:48

Answer 3

+1 A:

I kinda like your transparency idea -- the data you're confident about is opaque and the data you're not confident is transparent. It's easy for the user to understand, but it will look cluttered.

My take: Don't use heatmap. It's for continuous data, while you have discrete. Use dots. Color represents increase/decrease in the surrounding region and raw volume is proportional to size of the dot.

Now how does user know what region does the dot represent? Where does South Sydney convert into North Sydney? Best approach would be to add voronoi-like guiding lines between the dots, but smartly placed rectangles will do too.

Marcin 2009-09-24 07:09:49

I actually have the KML data for each zone, so I can accurately map the borders of each zone... or are you suggesting to ignore that and use something different?

nickf 2009-09-24 11:06:51

Answer 4

+1 A:

If you happen to have the area of each region in units such as sq. km, you can normalize your data by calculating home approvals/km^2 to get home approval density and use that in your equation rather than the count of home approvals. This is fix the problem if Dubbo contains less home approvals then other regions due to its size. You could also normalize by population if you have that, to get the number of home approvals per person.

Andrew 2009-09-24 07:23:06

Answer 5

+1 A:

Maybe you could use the totals. Add all old and new values which gives old=595, new=676, diff=+13.6%. Then calculate the changes bases on the old total which gives you +17.3% / -4.0% / +0.3% for the three places.

2009-09-24 08:29:36

Answer 6

+2 A:

Look into measures of statistical significance. It could be as simple as assuming counting statistics.

In a very simple minded version, the thing you plot is

 (A_2 - A_1)/sqrt(A_2 + A_1)

i.e. change over 1 sigma in simple counting statistics.

Which makes the above chart look like:

Area    Reduced difference
--------------------------
S.S.    +3.3  
N.S.    -1.3  
D.      +1.0

which is interpreted as meaning that South Sydney has experienced a significant (i.e. important, and possibly related to a real underlying cause) increasing, while North Sydney and Dubbo felt relatively minor changes that may or may not be point to a trend. Rule of thumb

1 sigma changes are just noise
3 sigma changes probably point to a underlying cause (and therefore the expectation of a trend)
5 sigma changes almost certainly point to a trend

Areas with very low rates (like Dubbo) will still be volatile, but they won't overwhelm the display.

dmckee 2009-09-24 13:27:16

Answer 7

A:

I would highly recommend going with a hierarchical model (i.e., partial pooling). Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill is an excellent resource on the topic.

Jonathan Chang 2009-09-24 17:31:30

Answer 8

A:

You can use an exact test like Fischer's exact test http://en.wikipedia.org/wiki/Fisher%27s%5Fexact%5Ftest , or use the sudent's t test http://en.wikipedia.org/wiki/Student%27s%5Ft-test , both of which are designed for low sample sizes.

As a note, the t-test is pretty much the same as a z-test but in the t-test you don't have to know the standard deviation nor do you have to approximate it like you would if you did a z-test.

You can apply a z or t test without any justification in 99.99% of cases because of the Central Limit Theorem http://en.wikipedia.org/wiki/Central%5Flimit%5Ftheorem (formally you only need that the underlying distribution X has finite variance.) You don't need justification for the fisher test either, its exact and does not make any assumptions.

ldog 2009-09-25 17:11:00

ansaurus

tags:

views:

answers:

When calculating trends, how do you account for low sample size?

related questions