views:

385

answers:

2

I am trying to visualize a large multi-dimensional dataset. Each dimension has a different range. Values in one column may range between 0-100 while values in another could range from a few hundred thousand to a few hundred million. Therefore it is really hard to show a graph with a reasonable scale. I would like to visualize them using a parallel coordinate chart or series of point charts.

I have come up with three different approaches; any outlier outside of the standard deviation will be removed, any outlier outside of the standard deviation will show up at the edge of the standard deviation, or I will adjust the scale of each dimension so that all are scaled to be seen at once.

Does anyone have any recommendations regarding which method would retain the integrity of the data? Also, does anyone have good examples of outlier removal algorithms?

+4  A: 

Outliers are usually removed for analysis (fitting a model), not for visualization, where you can just change your focus so the outliers are out of view.

The plots you mention, parallel coordinate chart and a scatterplot matrix, should be fine since each variable has its own scale. If your software forces all scales to be the same, the usual trick is to "standardize" the values by subtracting the mean and dividing by the standard deviation.

However, doing anything with standard deviation counts as analysis (so you may need to remove outliers after all) and assumes a Normal distribution. So first perform a distribution analysis of each variable -- at least look at a histogram and preferably do some kind of goodness of fit test against a normal curve. If the distribution is more like a Log Normal or something else, you can apply a transform to the data to make it Normal for better visualization (or use a log scale in the visualization).

Removing outliers is a bit of an art and should require knowledge of the data (because often the outliers are the interesting parts). The most common definition of outliers for box plots is points beyond 1.5 times the inter-quartile range from the box quartiles, which is usually farther than two standard deviation.

Another technique is often called "robust" fitting. You fit a Normal curve to the data and remove points from the tails that where the value of the fitted curve is very small. Repeat those two steps until it stabilizes (usually 1 to 3 times).

xan
A: 

Here is how I would go about doing it in SQL Server.

The query below will get the average weight from a fictional Scale table holding a single weigh-in for each person while not permitting those who are overly fat or thin to throw off the more realistic average:

select w.Gender, Avg(w.Weight) as AvgWeight
    from ScaleData w
    join ( select d.Gender, Avg(d.Weight) as AvgWeight, 
                  2*STDDEVP(d.Weight) StdDeviation
             from ScaleData d
            group by d.Gender
         ) d
      on w.Gender = d.Gender
     and w.Weight between d.AvgWeight-d.StdDeviation 
                      and d.AvgWeight+d.StdDeviation
   group by w.Gender  

There may be a better way to go about this, but it works and works well. If you have come across another more efficient solution, I’d love to hear about it.

NOTE: the above removes the top and bottom 5% of outliers out of the picture for purpose of the Average. You can adjust the number of outliers removed by adjusting the 2* in the 2*STDDEVP as per: http://en.wikipedia.org/wiki/Standard_deviation

Jason Southwell