ansaurus

Question

General frameworks for preparing training data?

Answer 1

+2 A:

I find myself mostly using the textutils from GNU coreutils and flex for corpus preparation, chaining things together in simple scripts, at least when the preparations i need to make are simple enough for regular expressions and trivial filtering etc.

It is still possible to make things reusable, the general rules also apply here. If you are programming with no regard to best practices and the like and just program procedurally there is IMHO really no wonder that you have to do everything from scratch when starting a new project.

Even though the format requirements will vary a lot there is still many common tasks, ie. tag-stripping, tag-translation, selection, tabulation, some trivial data harvesting such as number of tokens, sentences and the like. Programming these tasks aming for high reusability will pay off, even though it takes longer at first.

johanbev 2010-01-14 18:33:14

Answer 2

+1 A:

I am not aware of any such frameworks--doesn't mean they aren't out there. I prefer to use my own which is just a collection of code snippets i've refined/tweaked/borrowed over time and that i can chain together in various configurations depending on the problem. If you already know python, then i strongly recommend handling all of your data prep in NumPy--as you know, ML data sets tends to be large--thousands of row vectors packed with floats. NumPy is brilliant for that sort of thing. Additionally, I might suggest that for preparing training data for ML, there are a couple of tasks that arise in nearly every such effort and that don't vary a whole lot from one problem to the next. I've give you snippets for these below.

normalization (scaling & mean-centering your data to avoid overweighting. As i'm sure you know, you can scale -1 to 1 or 0 to 1. I usually chose the latter so that i can take advantage of sparsity patterns. In python, using the NumPy library:

import numpy as NP
data = NP.linspace( 1, 12, 12).reshape(4, 3)
data_norm = NP.apply_along_axis( lambda x : (x - float(x.min())) / x.max(), 
                                             0, data )

cross-validation (here's i've set the default argument at '5', so test set is 5%, training set, 95%--putting this in a function makes k-fold much simpler)

def divide_data(data, testset_size=5) :
  max_ndx_val = data.shape[0] -1
  ndx2 = NP.random.random_integers(0, max_ndx_val, testset_size)
  TE = data_rows[ndx2]
  TR = NP.delete(data, ndx2, axis=0)
  return TR, TE

Lastly, here's an excellent case study (IMHO), both clear and complete, showing literally the entire process from collection of the raw data through input to the ML algorithm (a MLP in this case). They also provide their code.

doug 2010-01-15 01:00:52

ansaurus

tags:

views:

answers:

General frameworks for preparing training data?

related questions