tags:

views:

96

answers:

1

Suppose you have a list of values

x <- list(a=c(1,2,3), b = c(2,3,4), c=c(4,5,6))

I would like to find unique values from all list elements combined. So far, the following code did the trick

unique(unlist(x))

Does anyone know a more efficient way? I have a hefty list with a lot of values and would appreciate any speed-up.

+3  A: 

This solution suggested by Marek is the best answer to the original Q. See below for a discussion of other approaches and why Marek's is the most useful.

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Discussion

A faster solution is to compute unique() on the components of your x first and then do a final unique() on those results. This will only work if the components of the list have the same number of unique values, as they do in both examples below. E.g.:

First your version, then my double unique approach:

> unique(unlist(x))
[1] 1 2 3 4 5 6
> unique.default(sapply(x, unique))
[1] 1 2 3 4 5 6

We have to call unique.default as there is a matrix method for unique that keeps one margin fixed; this is fine as a matrix can be treated as a vector.

Marek, in the comments to this answer, notes that the slow speed of the unlist approach is potentially due to the names on the list. Marek's solution is to make use of the use.names argument to unlist, which if used, results in a faster solution than the double unique version above. For the simple x of Roman's post we get

> unique(unlist(x, use.names = FALSE))
[1] 1 2 3 4 5 6

Marek's solution will work even when the number of unique elements differs between components.

Here is a larger example with some timings of all three methods:

## Create a large list (1000 components of length 100 each)
DF <- as.list(data.frame(matrix(sample(1:10, 1000*1000, replace = TRUE), 
                                ncol = 1000)))

Here are results for the two approaches using DF:

> ## Do the three approaches give the same result:
> all.equal(unique.default(sapply(DF, unique)), unique(unlist(DF)))
[1] TRUE
> all.equal(unique(unlist(DF, use.names = FALSE)), unique(unlist(DF)))
[1] TRUE
> ## Timing Roman's original:
> system.time(replicate(10, unique(unlist(DF))))
   user  system elapsed 
  12.884   0.077  12.966
> ## Timing double unique version:
> system.time(replicate(10, unique.default(sapply(DF, unique))))
   user  system elapsed 
  0.648   0.000   0.653
> ## timing of Marek's solution:
> system.time(replicate(10, unique(unlist(DF, use.names = FALSE))))
   user  system elapsed 
  0.510   0.000   0.512

Which shows that the double unique is a lot quicker to applying unique() to the individual components and then unique() those smaller sets of unique values, but this speed-up is purely due to the names on the list DF. If we tell unlist to not use the names, Marek's solution is marginally quicker than the double unique for this problem. As Marek's solution is using the correct tool properly, and it is quicker than the work-around, it is the preferred solution.

The big gotcha with the double unique approach is that it will only work if, as in the two examples here, each component of the input list (DF or x) has the same number of unique values. In such cases sapply simplifies the result to a matrix which allows us to apply unique.default. If the components of the input list have differing numbers of unique values, the double unique solution will fail.

Gavin Simpson
Check `system.time(replicate(10, unique(unlist(DF,FALSE,FALSE))))`. It's faster.
Marek
So it is. You only need `unique(unlist(DF, use.names = TRUE))` though to get the substantial speed up. I knew names on data frames was a source of slow down but it didn't occur to me that this would be a problem here. There isn't, as far as I can see, any recursion going on as `DF` and `x` only have a single level of components. You should post that as an answer as it uses the correct tools directly.
Gavin Simpson
Improvement from 47 seconds to 0.05 seconds. I would call that significant. :)
Roman Luštrik
@ucfagls You mean `unique(unlist(DF, use.names = FALSE))`? Please include it in your answer if you like it ;)
Marek
@Marek: oops, yes, I meant `unique(unlist(DF, use.names = FALSE))`. Will add your improvement to my answer with suitable attribution.
Gavin Simpson
I've now added Marek's solution to my answer with timings. A note on why these timings have changed; These new ones were done on my workstation whilst the ones in the original email were done on a newer machine at home.
Gavin Simpson
I though s(l)apply family works on each list element independently. How does one reconcile the above code where sapply compares values between list elements?
Roman Luštrik
`sapply` applies `unique` to each component of `DF` and returns a matrix, each row representing the unique values in the individual components. We then treat the matrix as a vector and ask for the unique values from this much smaller set.
Gavin Simpson
Of course, as I have just noted in the Answer, the double unique approach only works here if all the components have the same number of unique values. Marek's one works regardless.
Gavin Simpson