views:

431

answers:

1

I'm trying to build a binary classification decision tree out of huge (i.e. which cannot be stored in memory) datasets using MATLAB. Essentially, what I'm doing is:

  1. Collect all the data
  2. Try out n decision functions on the data
  3. Pick out the best decision function to separate the classes within the data
  4. Split the original dataset into 2
  5. Recurse on the splits

The data has k attributes and a classification, so it is stored as a matrix with a huge number of rows, and k+1 columns. The decision functions are boolean and act on the attributes assigning each row to the left or right subtree.

Right now I'm considering storing the data on files in chunks which can be held in memory and assigning an ID to each row so the decision to split is made by reading all the files sequentially and the future splits are identified by the ID numbers.

Does anyone know how to do this in a better fashion?

EDIT: The number of rows m is around 5e8 and k is around 500

+1  A: 

At each split, you are breaking the dataset into smaller and smaller subsets. Start with the single data file. Open it as a stream and just process one row at a time to figure out which attribute you want to split on. Once you have your first decision function, split the original data file into 2 smaller data files that each hold one branch of the split data. Recurse. The data files should become smaller and smaller until you can load them in memory. That way, you don't have to tag rows and keep jumping around in a huge data file.

Donnie DeBoer
+1 - thanks, this sounds pretty good!
Jacob

related questions