ansaurus

Question

Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation?

Answer 1

+1 A:

If you want to divide the data set once in two halves, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

or

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape[0], size=80)
test_idx = numpy.random.randint(x.shape[0], size=20)
training, test = x[training_idx,:], x[test_idx,:]

Finally, scikits.learn contains several cross validation methods (k-fold, leave-n-out, stratified-k-fold, ...). For the docs you might need to look at the examples or the latest git repository, but the code looks solid.

pberkes 2010-09-09 14:00:59

Answer 2

A:

I wrote a function for my own project to do this (it doesn't use numpy, though):

def partition(seq, chunks):
    """Splits the sequence into equal sized chunks and them as a list"""
    result = []
    for i in range(chunks):
        chunk = []
        for element in seq[i:len(seq):chunks]:
            chunk.append(element)
        result.append(chunk)
    return result

If you want the chunks to be randomized, just shuffle the list before passing it in.

Colin 2010-09-09 18:23:16

ansaurus

tags:

views:

answers:

Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation?

related questions