ansaurus

Question

how to remove partial duplicates from a data frame?

Answer 1

+4 A:

I would use subset combined with duplicated to filter non-unique timestamps in the second data frame:

R> df_ <- read.table(textConnection('
                     ts         v
1 "2009-09-30 10:00:00" -2.081609
2 "2009-09-30 10:15:00" -2.079778
3 "2009-09-30 10:15:00" -2.113531
4 "2009-09-30 10:15:00" -2.124716
5 "2009-09-30 10:15:00" -2.102117
6 "2009-09-30 10:30:00" -2.093542
7 "2009-09-30 10:30:00" -2.092626
8 "2009-09-30 10:45:00" -2.086339
9 "2009-09-30 11:00:00" -2.080144
'), as.is=TRUE, header=TRUE)

R> subset(df_, !duplicated(ts))
                   ts      v
1 2009-09-30 10:00:00 -2.082
2 2009-09-30 10:15:00 -2.080
6 2009-09-30 10:30:00 -2.094
8 2009-09-30 10:45:00 -2.086
9 2009-09-30 11:00:00 -2.080

Update: To select a specific value you can use aggregate

aggregate(df_$v, by=list(df_$ts), function(x) x[1])  # first value
aggregate(df_$v, by=list(df_$ts), function(x) tail(x, n=1))  # last value
aggregate(df_$v, by=list(df_$ts), function(x) max(x))  # max value

rcs 2009-11-20 10:24:49

this works, thanks! but how did you find it in the documentation? not even now that I know the answer do I manage to guess where to look! ALSO: imagine I did want to choose which value (say, the last one), does subset offer the possibility?

mariotomo 2009-11-20 13:04:45

one addition: 'subset' can also be used to remove duplicates from vectors? if so, how?

mariotomo 2009-11-20 16:19:47

It could be used, but `unique(vec)` is simpler.

rcs 2009-11-20 18:05:20

Answer 2

+3 A:

I think you are looking at data structures for time-indexed objects, and not for a dictionary. For the former, look at the zoo and xts packages which offer much better time-pased subsetting:

R> library(xts)
R> X <- xts(data.frame(val=rnorm(10)), \
            order.by=Sys.time() + sort(runif(10,10,300)))
R> X
                        val
2009-11-20 07:06:17 -1.5564
2009-11-20 07:06:40 -0.2960
2009-11-20 07:07:50 -0.4123
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
2009-11-20 07:10:11  1.0018
2009-11-20 07:10:12  1.0747
2009-11-20 07:10:58  0.7062
R> X["2009-11-20 07:08::2009-11-20 07:09"]
                        val
2009-11-20 07:08:18 -1.5574
2009-11-20 07:08:45 -1.8846
2009-11-20 07:09:47  0.4550
2009-11-20 07:09:57  0.9598
R>

The X object is ordered by a time sequence -- make sure it is of type POSIXct so you may need to parse your dates first. Then we can just index for '7:08 to 7:09 on the give day'.

Dirk Eddelbuettel 2009-11-20 13:10:00

I'm actually just trying to remove duplicates. I don't do much with the timestamps. thanks for pointing me to this library, but I think I prefer not adding dependencies.

mariotomo 2009-11-20 13:22:57

Look at unique() and duplicated() for that, and still use POSIXct types.

Dirk Eddelbuettel 2009-11-20 13:33:32

about POSIXct types, http://stackoverflow.com/questions/1803627/ helps understanding why Dirk Eddelbuettel suggests using it.

mariotomo 2009-12-01 10:35:38

ansaurus

tags:

views:

answers:

how to remove partial duplicates from a data frame?

related questions