What is a good way to split a numpy array randomly into training and testing / validation dataset? Something similar to the cvpartition or crossvalind functions in Matlab.
views:
85answers:
2
+1
A:
If you want to divide the data set once in two halves, you can use numpy.random.shuffle
, or numpy.random.permutation
if you need to keep track of the indices:
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]
or
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]
There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:
import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]
Finally, scikits.learn contains several cross validation methods (k-fold, leave-n-out, stratified-k-fold, ...). For the docs you might need to look at the examples or the latest git repository, but the code looks solid.
pberkes
2010-09-09 14:00:59
A:
I wrote a function for my own project to do this (it doesn't use numpy, though):
def partition(seq, chunks):
"""Splits the sequence into equal sized chunks and them as a list"""
result = []
for i in range(chunks):
chunk = []
for element in seq[i:len(seq):chunks]:
chunk.append(element)
result.append(chunk)
return result
If you want the chunks to be randomized, just shuffle the list before passing it in.
Colin
2010-09-09 18:23:16