views:

104

answers:

3

I have data which looks like this (this is test data for illustration):

test <- matrix(c(1, 1, 1, 2, 2, 2 , 529, 528, 528, 495, 525, 510,557, 535, 313,502,474, 487 ), nr=6, dimnames=list(c(1,2,3,4,5,6),c("subject", "rt1", "rt2")))

And I need to turn it into this:

test2<-matrix(c(1,1,1,2,2,2,529,528,528,495,525,510,"slow","slow","fast","fast","slow","slow",557, 535, 313,502,474, 487,"fast","fast","slow","slow","fast","fast"), nr=6, dimnames=list(c(1,2,3,4,5,6),c("subject", "rt1","speed1", "rt2","speed2")))

The speed1 column is calculated thus: calculate the median rt1 for the subject. If the individual value is less than the median it scores fast. If the individual cell value of rt1 is more than the median it scores slow. If the cell value is at the median, the cell is removed from the analysis (delete or NA) and the median for that subject is recalculated. This process is repeated for the speed2 column, but using rt2.

Perhaps some kind of if statement?

To clarify: I want the median for each subject (there are 40 in total) and for any values that are at the median (for that subject) to be excluded and the median recalculated (for that subject).

+4  A: 

EDITED TO ACTUALLY DO SUBJECT MEDIANS

You tend to be big on the matrix in examples when in actual fact what you are likely using are data frames. So let's get that out of the way first. The matrix requires you to be using a single type of data. I don't get the impression you really want your numbers to be text. Your other variables can't be numbers. Therefore, test2 should probably start as...

test2 <- data.frame(test)

and probably

test2$subject <- factor(test2$subject)

You might want to add a column that actually is the median/subject just to check what you're doing is correct. From here on I'll just work with RT1 and you can replicate for RT2.

test2$rt1med <- ave(test2$rt1, test2$subject, FUN = median)

This generates a column that has the median for each subject stored in it. You could have not made it a column but a standalone vector if you wished. Now, you are correct, it is as simple as a an if statement, ifelse() statement to be exact.

test2$speed1 <- ifelse(test2$rt1 > test2$rt1med, 'slow', 'fast')

I've left the medians in the frame. You said you wanted them gone. OK, just set the frame to itself without the medians...

test2 <- test2[test2$rt1 != test2$rt1med,]

But really, it's probably best to just keep track of the actual median values by indicating them, perhaps with NA...

test2$rt1[test2$rt1 == test2$rt1med] <- NA
John
Thank you for the help John. I am not a programmer, but I see the value in sticking with and learning R. This is very near the end of my phd research and I've picked up some good tips (and some bad habits). Can I ask a question? Does this recalculate the median based on the NA responses ie is the median recalculated with the score removed?
RSoul
Am I wrong in concluding that this doesn't calculate the median per subject? If the subject number is 1, I want the median for that subject. If one of the values is at the median (for that subject), it needs excluding from the analysis and the median for the subject recalculated. I think the above code only checks against the median of all subjects.
RSoul
But it is now...
John
@John: Indeed, +1
Joris Meys
+2  A: 

Following on from John's answer, to do per subject medians, use tapply:

test2 <- data.frame(test)
test2$subject <- factor(test2$subject)
test3 <- data.frame(subject=levels(test2$subject),median.rt1=tapply(test2$rt1,test2$subject,median),median.rt2=tapply(test2$rt2,test2$subject,median))
test2 <- merge(test2,test3)
test2$speed1 <- ifelse(test2$rt1 < test2$median.rt1, 'fast', 'slow') 
test2$speed2 <- ifelse(test2$rt2 < test2$median.rt2, 'fast', 'slow')

To remove the values at the median you can use,

subset(test2,!(rt1==median.rt1 | rt2==median.rt2))

Or some tolerance based test if you are expecting numerical representation error to cause problems with the straight equality test. You can then run the tapply and merge lines again (though maybe subsetting away the original median columns) to calculate new medians, and redo the speed classifications should you want to. Personally I would use a nested ifelse to classify as fast, slow or average though.

James
This answer got me nearest to getting the data I was looking for. Many thanks for the consideration.
RSoul
+2  A: 

Another solution that takes into account the "recalculation" of the median :

test2 <- data.frame(test)

makespeed <- function(x){
    id <- x != median(x)
    ifelse(x[id]-median(x[id]) <0,"slow","fast")
}

tapply(test2$rt1,as.factor(test2$subject),makespeed)

Now think about it for a second. You have three options :

  1. You have an even number of cases, and the median is defined as the average of the two middle cases. If these two are not equal to eachother, then no value is equal to the median.

  2. You have an odd number of cases, so the median is equal to 1 value in the data. If you remove that one, you have an even number of cases and you're back at case 1.

  3. You have a series of values equal to the median. You will end up with an even number of cases, of which the two middle ones are different. One is lower than the previous calculated median, one is higher. So you're back to case 1.

So in fact, if you're really interested in the difference with the median, you can use my code. If you only want to know whether it's fast or slow, then you don't even have to recalculate the median. After removing the necessary values, cases that were higher/lower than the old median will still be higher/lower than the new median. So basically, although James' and John's code technically doesn't do what you asked, it doesn't make a difference. In fact, it makes it easier to reconstruct the dataframe afterwards.

THe only case in which this doesn't function any more, is when you have 1 value left (that will be the median then, and should be removed so there is theoretically no result - see subject 1 in rt1), or when all values are equal (in that case, all values get removed and -again- there is no result.)

Joris Meys
Thanks. I'll go away and think about it. Please note, the reason I want to do it this way is because I am replicating somebody else's technique. So it doesn't really matter why I am doing it, I want to compare his technique on my data with the original findings.
RSoul
@Kafkaesque: I just wanted to indicate why the code of the other two answers is -for your case- perfectly valid, even though they don't do exactly what you had in mind. The result however is exactly the same.
Joris Meys