views:

146

answers:

2

Hello guy,

I'm searching the John Tukey algorithm which compute a "resistant line" or "median-median line" on my linear regression with R.

A student on a mailling list explain this algorithm in these terms :

"The way it's calculated is to divide the data into three groups, find the x-median and y-median values (called the summary point) for each group, and then use those three summary points to determine the line. The outer two summary points determine the slope, and an average of all of them determines the intercept."

Article about John tukey's median median for curious : http://www.johndcook.com/blog/2009/06/23/tukey-median-ninther/

Do you have an idea of where i could find this algorithm or R function ? In which packages, Thanks a lot !

+1  A: 

Check function rlm in MASS package I think this is what you want.

ilya
I doubt that -- `rlm()` is a robust routine but OP asked about something else.
Dirk Eddelbuettel
+5  A: 

There's a description of how to calculate the median-median line here. An R implementation of that is

median_median_line <- function(x, y, data)
{
  if(!missing(data))
  {
    x <- eval(substitute(x), data) 
    y <- eval(substitute(y), data) 
  }

  stopifnot(length(x) == length(y))

  #Step 1
  one_third_length <- floor(length(x) / 3)
  groups <- rep(1:3, times = switch((length(x) %% 3) + 1,
     one_third_length,
     c(one_third_length, one_third_length + 1, one_third_length),
     c(one_third_length + 1, one_third_length, one_third_length + 1)
  ))

  #Step 2
  x <- sort(x)
  y <- sort(y)

  #Step 3
  median_x <- tapply(x, groups, median)                                 
  median_y <- tapply(y, groups, median)

  #Step 4
  slope <- (median_y[3] - median_y[1]) / (median_x[3] - median_x[1])
  intercept <- median_y[1] - slope * median_x[1]

  #Step 5
  middle_prediction <- intercept + slope * median_x[2]
  intercept <- intercept + (median_y[2] - middle_prediction) / 3
  c(intercept = unname(intercept), slope = unname(slope))
}

To test it, here's the second example from that page:

dfr <- data.frame(
  time = c(.16, .24, .25, .30, .30, .32, .36, .36, .50, .50, .57, .61, .61, .68, .72, .72, .83, .88, .89),
  distance = c(12.1, 29.8, 32.7, 42.8, 44.2, 55.8, 63.5, 65.1, 124.6, 129.7, 150.2, 182.2, 189.4, 220.4, 250.4, 261.0, 334.5, 375.5, 399.1))

median_median_line(time, distance, dfr) 
#intercept     slope 
#   -113.6     520.0

Note the slightly odd way of specifying the groups. The instructions are quite picky about how you define group sizes, so the more obvious method of cut(x, quantile(x, seq.int(0, 1, 1/3))) doesn't work.

Richie Cotton
Wow ! Thx a lot Richie Cotton !! It's perfect ;)
reyman64