views:

297

answers:

8

I'm doing some work processing some statistics for home approvals in a given month. I'd like to be able to show trends - that is, which areas have seen a large relative increase or decrease since the last month(s).

My first naive approach was to just calculate the percentage change between two months, but that has problems when the data is very low - any change at all is magnified:

// diff = (new - old) / old
     Area      |  June  |  July  |  Diff  |
 --------------|--------|--------|--------|
 South Sydney  |   427  |   530  |  +24%  |
 North Sydney  |   167  |   143  |  -14%  |
 Dubbo         |     1  |     3  | +200%  |

I don't want to just ignore any area or value as an outlier, but I don't want Dubbo's increase of 2 per month to outshine the increase of 103 in South Sydney. Is there a better equation I could use to show more useful trend information?

This data is eventually being plotted on Google Maps. In this first attempt, I'm just converting the difference to a "heatmap colour" (blue - decrease, green - no change, red - increase). Perhaps using some other metric to alter the view of each area might be a solution, for example, change the alpha channel based on the total number of approvals or something similar, in this case, Dubbo would be bright red, but quite transparent, whereas South Sydney would be closer to yellow but quite opaque.

Any ideas on the best way to show this data?

+3  A: 

This is really a statistics question. I'm not a statistician, but I suspect the answer is along the lines of well, you have no data — what do you expect‽

Perhaps you could merge Dubbo with a nearby region? You've sliced your data small enough that your signal has fallen below noise.

You could also just not show Dubbo, or make a color for not enough data.

derobert
+1 for the interrobang
nickf
A: 

With a heat map you are generally attempting to show easily assimilated information. Anything too complex would probably be counter-productive.

In the case of Dubbo, the reality is that you don't have the data to draw any firm conclusions about it, so I'd color it white, say. You could possibly label it with the difference/current value too.

I think this would be preferable to possibly misleading the users.

dommer
+1  A: 

I kinda like your transparency idea -- the data you're confident about is opaque and the data you're not confident is transparent. It's easy for the user to understand, but it will look cluttered.

My take: Don't use heatmap. It's for continuous data, while you have discrete. Use dots. Color represents increase/decrease in the surrounding region and raw volume is proportional to size of the dot.

Now how does user know what region does the dot represent? Where does South Sydney convert into North Sydney? Best approach would be to add voronoi-like guiding lines between the dots, but smartly placed rectangles will do too.

Marcin
I actually have the KML data for each zone, so I can accurately map the borders of each zone... or are you suggesting to ignore that and use something different?
nickf
+1  A: 

If you happen to have the area of each region in units such as sq. km, you can normalize your data by calculating home approvals/km^2 to get home approval density and use that in your equation rather than the count of home approvals. This is fix the problem if Dubbo contains less home approvals then other regions due to its size. You could also normalize by population if you have that, to get the number of home approvals per person.

Andrew
+1  A: 

Maybe you could use the totals. Add all old and new values which gives old=595, new=676, diff=+13.6%. Then calculate the changes bases on the old total which gives you +17.3% / -4.0% / +0.3% for the three places.

+2  A: 

Look into measures of statistical significance. It could be as simple as assuming counting statistics.

In a very simple minded version, the thing you plot is

 (A_2 - A_1)/sqrt(A_2 + A_1)

i.e. change over 1 sigma in simple counting statistics.

Which makes the above chart look like:

Area    Reduced difference
--------------------------
S.S.    +3.3  
N.S.    -1.3  
D.      +1.0

which is interpreted as meaning that South Sydney has experienced a significant (i.e. important, and possibly related to a real underlying cause) increasing, while North Sydney and Dubbo felt relatively minor changes that may or may not be point to a trend. Rule of thumb

  • 1 sigma changes are just noise
  • 3 sigma changes probably point to a underlying cause (and therefore the expectation of a trend)
  • 5 sigma changes almost certainly point to a trend

Areas with very low rates (like Dubbo) will still be volatile, but they won't overwhelm the display.

dmckee
A: 

I would highly recommend going with a hierarchical model (i.e., partial pooling). Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill is an excellent resource on the topic.

Jonathan Chang
A: 

You can use an exact test like Fischer's exact test http://en.wikipedia.org/wiki/Fisher%27s%5Fexact%5Ftest , or use the sudent's t test http://en.wikipedia.org/wiki/Student%27s%5Ft-test , both of which are designed for low sample sizes.

As a note, the t-test is pretty much the same as a z-test but in the t-test you don't have to know the standard deviation nor do you have to approximate it like you would if you did a z-test.

You can apply a z or t test without any justification in 99.99% of cases because of the Central Limit Theorem http://en.wikipedia.org/wiki/Central%5Flimit%5Ftheorem (formally you only need that the underlying distribution X has finite variance.) You don't need justification for the fisher test either, its exact and does not make any assumptions.

ldog