How to efficiently generate random subsets of rows from a matrix

views:

138

answers:

+1 Q:

How to efficiently generate random subsets of rows from a matrix

Hi all,

I have a large matrix M implemented as vector<vector<double> with m rows, i.e. the matrix is a vector of m vectors of n column elements.

I have to create two subsets of the rows of this matrix, i.e. A holds k rows, and B the other m-k rows. The rows must be selected at random.

I do not want to use any libraries other than STL, so no boost either.

Two approaches that I considered are:

generate a std::random_shuffle of row indices, copy the rows indicated by the first k indices to A and the rows indicated by the other m-k to B
do a std::random_shuffle of M. copy k rows to A, and m-k rows to B

Are there other options, and how do the two options above compare in terms of memory consumption and processing time?

Thanks!

Easiest way: use a random whole number generator, and queue up the offsets of each row in a separate container (assuming that a row is of the same offset in each column vector). The container you use will depend more on its eventual use. (Remember to take care of size_t limit, and tying in the offset container's life to the Matrix itself).

Edit: replaced pointers with offsets - makes more sense and is safer.

Orig: Quick Q:is each (inner) vector a row or a column?

i.e. is M a vector of columns or a vector of rows?

Fox 2009-11-19 13:12:00

You should ask this kind of question in a comment to the original question, since this is not an answer.

Luc Touraille 2009-11-19 13:18:11

@Luc: Thanks. Given my reps, I am not yet allowed to comment on other users' replies, so this was an improvized response.

Fox 2009-11-19 13:41:03

Updated the question

andreas buykx 2009-11-19 13:48:05

+1 A:

I think that the random_shuffle of indices makes sense.

If you need to avoid the overhead of copying the individual rows, and don't mind sharing data, you might be able to make the A and B matrices be vectors of pointers to rows in the original matrix.

Nate Kohl 2009-11-19 13:24:29

if I have vectors of pointers, I can nolonger access a matrix element as m[i][j]. I need to be able to use A and B as vector<vector<double>>

andreas buykx 2009-11-19 13:47:35

Actually, if you have a vector of pointers to the first element of each row, then you can access a matrix element as m[i][j]. So you could do `vector<double*> A(N); A[0] = ` and so on. Of course you'd have to preserve the shuffled matrix for as long as A and B need to be valid.

Steve Jessop 2009-11-19 14:19:27

+2 A:

If you don't need B to be in random order, then random_shuffle does more work than you need.

If by "STL" you mean SGI's STL, then use random_sample.

If by "STL" you mean the C++ standard libraries, then you don't have random_sample. You might want to copy the implementation, except stop after the first n steps. This will reduce the time.

Note that these both modify a sequence in place. Depending where you actually want A and B to end up, and who owns the original, this might mean that you end up doing 2 copies of each row - once to get it into a mutable container for the shuffle, then again to get it into its final destination. This is more memory and processing time than is required. To fix this you could maybe swap rows out of the temporary container, and into A and B. Or copy the algorithm, but adapt it to:

Make a list of the indexes of the first vector
Partially shuffle the list of indexes
Copy the rows corresponding to the first n indexes to A, and the rest to B.

I'm not certain this is faster or uses less memory, but I suspect so.

The standard for random_shuffle says that it performs "swaps". I hope that means it's efficient for vectors, but you might want to check that it is actually using an optimised swap, not doing any copying. I think it should mean that, especially since the natural implementation is as Fisher-Yates, but I'm not sure whether the language in the standard should be taken to guarantee it. If it is copying, then your second approach is going to be very slow. If it's using swap then they're roughly comparable. swap on a vector is going to be slightly slower than swap on an index, but there's not a whole lot in it. Swapping either a vector or an index is very quick compared with copying a row, and there are M of each operation, so I doubt it will make a huge difference to total run time.

[Edit: Alex Martelli was complaining recently about misuse of the term "STL" to mean the C++ standard libraries. In this case it does make a difference :-)]

Steve Jessop 2009-11-19 13:31:26

What exactly do you mean by C++ standard libraries? The S(tandard) T(emplate) L(ibraries) are also standard libraries aren't they? In any case I'm looking for an efficient implementation which I presume is likely one that is implemented in terms of STL algorithms, because it seems their implementation may be optimised more than a trivial implementation of the algorithm will be.

andreas buykx 2009-11-19 13:52:07

The STL was a thing invented by SGI, as a proposal for incorporation into the C++ standard. What was actually incorporated into the C++ standard was only a subset of the STL, with some modifications. So for example, `std::random_sample` and `std::itoa` were dropped. But (as Alex was complaining), people carried on calling the C++ standard libraries the "STL", which is technically inaccurate.

Steve Jessop 2009-11-19 13:55:05

"their implementation may be optimised more than a trivial implementation of the algorithm will be". Maybe, but when I say "copy the implementation", I mean open up the relevant header file and copy the source. If it's optimised, you'll see the optimisations. I'm not sure there's a great deal that can be done to optimise random_shuffle, anyway.

Steve Jessop 2009-11-19 14:02:18

ansaurus

tags:

views:

answers:

How to efficiently generate random subsets of rows from a matrix

related questions