Learning decision trees on huge datasets | ansaurus

tags:

views:

431

answers:

1

Q:

Learning decision trees on huge datasets

I'm trying to build a binary classification decision tree out of huge (i.e. which cannot be stored in memory) datasets using MATLAB. Essentially, what I'm doing is:

Collect all the data
Try out n decision functions on the data
Pick out the best decision function to separate the classes within the data
Split the original dataset into 2
Recurse on the splits

The data has k attributes and a classification, so it is stored as a matrix with a huge number of rows, and k+1 columns. The decision functions are boolean and act on the attributes assigning each row to the left or right subtree.

Right now I'm considering storing the data on files in chunks which can be held in memory and assigning an ID to each row so the decision to split is made by reading all the files sequentially and the future splits are identified by the ID numbers.

Does anyone know how to do this in a better fashion?

EDIT: The number of rows m is around 5e8 and k is around 500

+1 A:

At each split, you are breaking the dataset into smaller and smaller subsets. Start with the single data file. Open it as a stream and just process one row at a time to figure out which attribute you want to split on. Once you have your first decision function, split the original data file into 2 smaller data files that each hold one branch of the split data. Recurse. The data files should become smaller and smaller until you can load them in memory. That way, you don't have to tag rows and keep jumping around in a huge data file.

Donnie DeBoer 2009-07-17 19:17:55

+1 - thanks, this sounds pretty good!

Jacob 2009-07-17 19:38:20

related questions

Best MATLAB toolbox that implements Support Vector Regression?

What is the simplest way to export Excel data to Matlab?

Why do I get this error message in matlab?

In Matlab, what does zpk([], [i, -i], 1) mean?

Running on 32 or 64 bit matlab?

Creating output m-file in matlab

Reading input m-file in a main m-file

What is MATLAB good for? Why is it so used by universities? When is it better than Python?

matlab FFT. Stuck understanding relationship between frequency and result

In MATLAB, how do I change the background color of a subplot?

mathematical optimization library for Java --- free or open source recommendations?

MATLAB Environment Tweaks

Why don't I see a significant speed-up when using the MATLAB compiler?

How can I sort a 2-D array in MATLAB with respect to one column?

What are your favourite MATLAB/Octave programming tricks?

In MATLAB, can a class method act as a uicontrol callback without being public?

How can I generate a list of function dependencies in MATLAB?

What alternatives are there to the MATLAB Editor?

Open source alternative to MATLAB's fmincon function?

Can parallel traversals be done in MATLAB just as in Python?

Running a Simulink xPC block at a faster rate than the continuous rate

How do I change Simulink xPC target serial comm speed on the fly

What are some efficient ways to combine two structures in MATLAB?

How can I stop MATLAB from returning until after a command-line script completes?

Anyone have experience creating a shared library in MATLAB?