views:

678

answers:

3

I have data in the following format called DF (this is just a made up simplified sample):

eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random
1         1           1500     1500          100        120        40       232342
2         2           1000     1250          100        120        40       11843
3         3           1250     1250          100        120        40       981340234
4         4           1000     1187.5        100        120        40       4363453
5         1           2000     2000          200        100        40       345902
6         1           3000     3000          150        90         10       943
7         1           2000     2000          90         90         100      9304358
8         2           1800     1900          90         90         100      284333

However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows.

The example above uses the expected values, but assume they are incorrect.

How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables?

I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.

+3  A: 

Ok, let's first do it in the easy case where you just have one column.

> data <- rep(sample(1000, 5),
              sample(5, 5))
> head(data)
[1] 435 435 435 278 278 278

Then you can just use rle to figure out the contiguous sequences:

> sequence(rle(data)$lengths)
[1] 1 2 3 1 2 3 4 5 1 2 3 4 1 2 1

Or altogether:

> head(cbind(data, sequence(rle(data)$lengths)))
[1,]  435 1
[2,]  435 2
[3,]  435 3
[4,]  278 1
[5,]  278 2
[6,]  278 3

For your case with multiple columns, there are probably a bunch of ways of applying this solution. Easiest might be to just paste the columns you care about together to form a single vector.

Jonathan Chang
+1  A: 

Okay I used the answer I had on another question and worked out a loop that I think will work. This is what I'm going to use:

cmpfun2 <- function(r) {
    count <- 0
    if (r[1] > 1)
    {
        for (row in 1:(r[1]-1))
        {
            if(all(r[27:51] == DF[row,27:51,drop=FALSE]))  # compare to row bind
            {
                count <- count + 1
            }
        }
    }
    return (count)
}
brows <- apply(DF[], 1, cmpfun2)
print(brows)

Please comment if I made a mistake and this won't work, but I think I've figured it out. Thanks!

Matt
Your solution looks OK but if I were you I'd at least try to understand Jonathan's answer above. RLE is often a quicker and more robust alternative compared to loops and index gymnastics.
Leo Alekseyev
A: 

had the same problem, it gets especially annoying with large data.frames. My latest solution is to creat a subset with the unique observations and then deleting the rows in the original dataframe. Sounds stupid and overly complicted but its pretty fast and straightforward:

d<-unique(DF[, c("column1","column2")]) # select rows with unique combinations of entries for the 2 columns
d1<-(as.numeric(rownames(d)))*(-1) # make rownames numeric and negativ
d2<-DF[d1,] # matrix mit doppelten Werten # subtract rows of unique observations from original data.frame.

The new data.frame d2 has now all observations with not-unique combintations of the entries in the specified columns. I'm the biggest coder and didnt get your loop, but I think it works fine, doesnt it?

SKr