views:

103

answers:

2

As a student of computational linguistics, I frequently do machine learning experiments where I have to prepare training data from all kinds of different resources like raw or annotated text corpora or syntactic tree banks. For every new task and every new experiment I write programs (normally in Python and sometimes Java) to extract the features and values I need and transform the data from one format to the other. This usually results in a very large number of very large files and a very large number of small progams which process them in order to get the input for some machine learning framework (like the arff files for Weka).

One needs to be extremely well organised to deal with that and program with great care not to miss any important peculiarities, exceptions or errors in the tons of data. Many principles of good software design like design patterns or refactoring paradigms are no big use for these tasks because things like security, maintainability or sustainability are of no real importance - once the program successfully processed the data one doesn't need it any longer. This has gone so far that I even stopped bothering about using classes or functions at all in my Python code and program in a simple procedural way. The next experiment will require different data sets with unique characteristics and in a different format so that their preparation will likely have to be programmed from scratch anyway. My experience so far is that it's not unusual to spend 80-90% of a project's time on the task of preparing training data. Hours and days go by only on thinking about how to get from one data format to another. At times, this can become quite frustrating.

Well, you probably guessed that I'm exagerating a bit, on purpose even, but I'm positive you understand what I'm trying to say. My question, actually, is this:

Are there any general frameworks, architectures, best practices for approaching these tasks? How much of the code I write can I expect to be reusable given optimal design?

Thanks!

+2  A: 

I find myself mostly using the textutils from GNU coreutils and flex for corpus preparation, chaining things together in simple scripts, at least when the preparations i need to make are simple enough for regular expressions and trivial filtering etc.

It is still possible to make things reusable, the general rules also apply here. If you are programming with no regard to best practices and the like and just program procedurally there is IMHO really no wonder that you have to do everything from scratch when starting a new project.

Even though the format requirements will vary a lot there is still many common tasks, ie. tag-stripping, tag-translation, selection, tabulation, some trivial data harvesting such as number of tokens, sentences and the like. Programming these tasks aming for high reusability will pay off, even though it takes longer at first.

johanbev
+1  A: 

I am not aware of any such frameworks--doesn't mean they aren't out there. I prefer to use my own which is just a collection of code snippets i've refined/tweaked/borrowed over time and that i can chain together in various configurations depending on the problem. If you already know python, then i strongly recommend handling all of your data prep in NumPy--as you know, ML data sets tends to be large--thousands of row vectors packed with floats. NumPy is brilliant for that sort of thing. Additionally, I might suggest that for preparing training data for ML, there are a couple of tasks that arise in nearly every such effort and that don't vary a whole lot from one problem to the next. I've give you snippets for these below.

normalization (scaling & mean-centering your data to avoid overweighting. As i'm sure you know, you can scale -1 to 1 or 0 to 1. I usually chose the latter so that i can take advantage of sparsity patterns. In python, using the NumPy library:

import numpy as NP
data = NP.linspace( 1, 12, 12).reshape(4, 3)
data_norm = NP.apply_along_axis( lambda x : (x - float(x.min())) / x.max(), 
                                             0, data )

cross-validation (here's i've set the default argument at '5', so test set is 5%, training set, 95%--putting this in a function makes k-fold much simpler)

def divide_data(data, testset_size=5) :
  max_ndx_val = data.shape[0] -1
  ndx2 = NP.random.random_integers(0, max_ndx_val, testset_size)
  TE = data_rows[ndx2]
  TR = NP.delete(data, ndx2, axis=0)
  return TR, TE

Lastly, here's an excellent case study (IMHO), both clear and complete, showing literally the entire process from collection of the raw data through input to the ML algorithm (a MLP in this case). They also provide their code.

doug