views:

160

answers:

3

Because SO is a bit slow lately, I'm posting an easy question. I would appreciate it if big fishes stayed on the bench for this one and give rookies a chance to respond.

Sometimes we have objects that have a ridiculous amount of large list elements (vectors). How would you "unlist" this object into a single vector. Show proof that your method is faster than unlist().

+1  A: 

A non-unlist() solution would have to be pretty darned fast to beat unlist() would it not? Here it takes less than two second to unlist a list with 2000 numeric vectors of length 100,000 each.

> bigList2 <- as.list(data.frame(matrix(rep(rnorm(1000000), times = 200), 
+                                       ncol = 2000)))
> print(object.size(bigList2), units = "Gb")
1.5 Gb
> system.time(foo <- unlist(bigList2, use.names = FALSE))
   user  system elapsed 
  1.897   0.000   2.019

With bigList2 and foo in my workspace, R is using ~9Gb of my available memory. The key is use.names = FALSE. Without it unlist() is painfully slow. Exactly how slow I'm still waiting to find out...

We can speed this up a little bit more by setting recursive = FALSE and then we have effectively the same as VitoshKa's answer (two representative timings):

> system.time(foo <- unlist(bigList2, recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.379   0.001   1.416
> system.time(foo <- .Internal(unlist(bigList2, FALSE, FALSE)))
   user  system elapsed 
  1.335   0.000   1.344

... finally the use.names = TRUE version finished...:

> system.time(foo <- unlist(bigList2, use = TRUE))
    user   system  elapsed 
2307.839   10.978 2335.815

and it was consuming all my systems 16Gb of RAM so I gave up at that point...

Gavin Simpson
+4  A: 

If you don't need names and your list is one level deep, then if you can beat

.Internal(unlist(your_list, FALSE, FALSE))

I will vote up everything you do on SO for the next 1 year!!!

[Update: if non-unique names are needed and the list is not recursive, here is a version which improves over the unlist 100 times

 myunlist <- function(l){
    names <- names(l)
    vec <- unlist(l, F, F)
    reps <- unlist(lapply(l, length), F, F)
    names(vec) <- rep(names, reps)
    vec
    }

 myunlist(list(a=1:3, b=2))
 a a a b 
 1 2 3 2 

 > tl <- list(a = 1:20000, b = 1:5000, c = 2:30)
 > system.time(for(i in 1:200) unlist(tl))
 user  system elapsed 
 22.97    0.00   23.00 

 > system.time(for(i in 1:200) myunlist(tl))
 user  system elapsed 
 0.2     0.0     0.2 

 > system.time(for(i in 1:200) unlist(tl, F, F))
 user  system elapsed 
 0.02    0.00    0.02 

]

[Update2: Responce to challenge Nr3 from Richie Cotton.

bigList3 <- replicate(500, rnorm(1e3), simplify = F)

unlist_vit <- function(l){
    names(l) <- NULL
    do.call(c, l)
    }

library(rbenchmark)

benchmark(unlist = unlist(bigList3, FALSE, FALSE),
          rjc    = unlist_rjc(bigList3),
          vit    = unlist_vit(bigList3),
          order  = "elapsed",
          replications = 100,
          columns = c("test", "relative", "elapsed")
          )

    test  relative elapsed
1 unlist   1.0000    2.06
3    vit   1.4369    2.96
2    rjc   3.5146    7.24

]

PS: I assume a "big fish" is the one with more reputation than you. So I am pretty much small here :).

VitoshKa
+1 It is probably not significant, but in my test your version was a bit faster than @ucfagls'. Yet the biggest speedup is gathered from use.names=F.
mbq
I understood from Roman's question that his desire was to replace the build-in "unlist" with something smarter. As far as I am concerned this is not possible when names are not required.
VitoshKa
Now that's an offer a person can't refuse! If I write 6 items a day I may have marginal chance of catching Dirk. FWIW, I consider big fishes people with several 1000 points.
Roman Luštrik
On my system there was nothing to choose between `unlist(bigList2, recursive = FALSE, use.names = FALSE)`, and `.Internal(unlist(bigList2, FALSE, FALSE)`. The overhead in `unlist()` before the `.Internal` call is negligible. Unless I'm really, really sure I have the correct object, I try to stay away from `.Internal` just in case, as you can crash R if you get something wrong or provide something the function wasn't expecting.
Gavin Simpson
@VitoshKa, you are right, I'm looking for an alternative. Names are dispensable.
Roman Luštrik
@ucfagis , that's right for big vectors and small number of iterations. But if your list consists only small vectors and you run long simulations, the improvement of .Internal could be as big as double!!!
VitoshKa
@VitoshKa; Oh yes, agreed. I mean't in relation to this problem with big lists containing long vectors as components (after Roman's specification). If you do this repeatedly in testing then of course you'll also incur the small overhead repeatedly, but I suspect you're not going to be unlisting a big list like `bigList2` (from my answer) too often in the same R session unless you have oodles of RAM ;-)
Gavin Simpson
Very cute response to my challenge. One upvote duly provided.
Richie Cotton
A: 

As a medium-size fish, I'm jumping in with a first-attempt solution that gives a benchmark for little fishes to beat. It's about 3 times slower than unlist.

I'm using a smaller version of ucfagls's test list. (Since it fits in memory better.)

bigList3 <- as.list(data.frame(matrix(rep(rnorm(1e5), times = 200), ncol = 2000)))

The basic idea is to create one long vector to store the answer, then loop over list items copying values from the list.

unlist_rjc <- function(l)
{
  lengths <- vapply(l, length, FUN.VALUE = numeric(1), USE.NAMES = FALSE)
  total_len <- sum(lengths)
  end_index <- cumsum(lengths)
  start_index <- 1 + c(0, end_index)
  v <- numeric(total_len)
  for(i in seq_along(l))
  {
    v[start_index[i]:end_index[i]] <- l[[i]]
  }
  v
}

t1 <- system.time(for(i in 1:10) unlist(bigList2, FALSE, FALSE))
t2 <- system.time(for(i in 1:10) unlist_rjc(bigList2))
t2["user.self"] / t1["user.self"]  # 3.08

Challenges for little fishes:
1. Can you extend it to deal with other types than numeric?
2. Can you get it to work with recursion (nested lists)?
3. Can you make it faster?

I'll upvote anyone with less points than me whose answer meets one or more of these mini-challenges.

Richie Cotton
Medium-size? According to [ranking of top users in R tag](http://stackoverflow.com/tags/r/topusers) you are 10th answerers in last 30 days, and 7th all time ;)
Marek
Wow. Hadn't realised that. Yay me. Anyway, my offer still stands. Improve/beat my answer -> get an upvote.
Richie Cotton
Also, challenge 1 is easy.
Richie Cotton
on my machine your thing is about 5 times slower.
VitoshKa
@Richie , After clean restart it's only 3.5 times slower. I updated my answer for your challenge nr3. Waiting for upvote:)
VitoshKa