I'm trying to build a binary classification decision tree out of huge (i.e. which cannot be stored in memory) datasets using MATLAB. Essentially, what I'm doing is:
- Collect all the data
- Try out n decision functions on the data
- Pick out the best decision function to separate the classes within the data
- Split the original dataset into 2
- Recurse on the splits
The data has k attributes and a classification, so it is stored as a matrix with a huge number of rows, and k+1 columns. The decision functions are boolean and act on the attributes assigning each row to the left or right subtree.
Right now I'm considering storing the data on files in chunks which can be held in memory and assigning an ID to each row so the decision to split is made by reading all the files sequentially and the future splits are identified by the ID numbers.
Does anyone know how to do this in a better fashion?
EDIT: The number of rows m is around 5e8 and k is around 500