ansaurus

Question

R: Are there any alternatives to loops for subsetting from an optimization standpoint?

Answer 1

+4 A:

This is pretty much exactly what the plyr package is designed to make easier. However it's unlikely that it will make things much faster - most of the time is probably spent doing the statistics.

hadley 2010-03-27 01:31:15

Thanks for the response! That was my impression from skimming through the manual for the plyr package awhile back; that it generally made things simpler to write, but wouldn't provide much in the performance realm. I'd probably use plyr, but I'm in a workgroup where most aren't very experienced with R and the for loop along with the explicitness of what's happening inside seems to be pretty intuitive to read. Maybe I'll give plyr a try though and see if it provides any benefits. Every little bit helps :)

Adam 2010-03-27 01:47:32

Also, according to some code I wrote in the above format that I used Rprof on, the vast majority of time spent in the above code is spent on subsetting the dataset.

Adam 2010-03-27 04:55:46

Exactly how big is `data.mat`?

hadley 2010-03-27 16:02:29

Talking of speed, data.table is much faster - see answer below.

2010-03-27 17:25:15

Answer 2

+2 A:

You have already suggested vectorizing and avoiding making unnecessary copies of intermediate results, so you are certainly on the right track. Let me caution you not to do what i did and assume that vectorizing will give you a performance boost--like it does in other langauges (e.g., Python+NumPy, MATLAB).

An example:

# small function to time the results:
time_this = function(...) {
  start.time = Sys.time(); eval(..., sys.frame(sys.parent(sys.parent()))); 
  end.time = Sys.time(); print(end.time - start.time)
}

# data for testing: a 10000 x 1000 matrix of random doubles
a = matrix(rnorm(1e7, mean=5, sd=2), nrow=10000)

# two versions doing the same thing: calculating the mean for each row
# in the matrix
x = time_this( for (i in 1:nrow(a)){ mean( a[i,] ) } )
y = time_this( apply(X=a, MARGIN=1, FUN=mean) )

print(x)    # returns => 0.5312099
print(y)    # returns => 0.661242

The 'apply' version is actually slower than the 'for' version. (According to the Inferno author, if you are doing this you are not vectorizing, you are 'loop hiding'.)

But where you can get a performance boost is by using built-ins. Below, i've timed the same operation as the two above, just using the built-in function, 'rowMeans':

z = time_this(rowMeans(a))
print(z)    # returns => 0.03679609

An order of magnitude improvement versus the 'for' loop (and the vectorized version).

The other members of the apply family are not just wrappers over a native 'for' loop.

a = abs(floor(10*rnorm(1e6)))

time_this(sapply(a, sqrt))
# returns => 6.64 secs

time_this(for (i in 1:length(a)){ sqrt(a[i])})
# returns => 1.33 secs

'sapply' is about 5x slower compared with a 'for' loop.

Finally, w/r/t vectorized versus 'for' loops, i don't think i ever use a loop if i can use a vectorized function--the latter is usually less keystrokes and and it's a more natural way (for me) to code, which is a different kind of performance boost, i suppose.

doug 2010-03-27 02:21:07

Using apply is not vectorizing.

Shane 2010-03-27 02:53:50

yep--will edit w/ example from lapply, sapply, etc.

doug 2010-03-27 03:16:01

Forgive me if I misunderstood, but don't the results show that using sapply gives a 5x performance decrement rather than a boost? That's how it turned out when I ran your code fragments. In any case, thank you for the detailed example of applys not always helping out. I don't think I have to worry about that in this case though, as unless I'm mistaken, the apply family doesn't apply (ha!) to the situation where you're subsetting chunks of the dataset and performing some sort of statistical operation on that chunk because it's not going down the vector and doing the same thing on all items/rows

Adam 2010-03-27 04:06:50

thanks--typo. edited answer.

doug 2010-03-27 11:26:05

Doug, why the `time_this()` business when there is `system.time()`? And why not several reruns? I often do something like 'mean(replicate(N, system.time(someRexpressionHere)["elapsed"]), trim=0.1)' which drops the best and worst times and averages over the remainder. One could argue for summary() or median() or ... but the point is: one data point, as you show here, is not that informative as you may get hit by function loading times etc which are really one-offs.

Dirk Eddelbuettel 2010-03-27 18:58:25

agreed, thanks Dirk.

doug 2010-03-28 03:38:23

Answer 3

+2 A:

Besides plyr, you can try to use foreach package to exclude explicit loop counter, but I don't know if it will give you any performance benefits.

Foreach, neverless, gives you a quite simple interface to parallel chunk processing if you have multicore workstation (with doMC/multicore packages) (check Getting Started with doMC and foreach for details), if you exclude parallel processing only because it is not very easy to understand for students. If it is not the only reason, plyr is very good solution IMHO.

zzr 2010-03-27 11:59:42

Answer 4

+1 A:

Personally, I find plyr not very easy to understand. I prefer data.table which is also faster. For instance you want to do the standard deviation of colum my_column for each ID.

dt <- datab.table[df] # one time operation...changing format of df to table
result.sd <- dt[,sd(my_column),by="ID"] # result with each ID and SD in second column

Three statements of this kind and a cbind at the end - that is all you need. You can also use dt do some action for only one ID without a subset command in an new syntax:

result.sd.oneiD<- dt[ID="oneID",sd(my_column)]

The first statment refers to rows (i), the second to columns (j).

If find it easier to read then player and it is more flexible, as you can also do sub domains within a "subset"... The documentation describes that it uses SQL-like methods. For instance, the by is pretty much "group by" in SQL. Well, if you know SQL, you can probably do much more, but it is not necessary to make use of the package. Finally, it is extremely fast, as each operation is not only parallel, but also data.table grabs the data needed for calculation. Subset, however, maintain the levels of the whole matrix and drag it trough the memory.

2010-03-27 17:23:30

ansaurus

tags:

views:

answers:

R: Are there any alternatives to loops for subsetting from an optimization standpoint?

related questions