views:

38

answers:

3

In an experimental project I am playing with I want to be able to look at textual data and detect whether it contains data in a tabular format. Of course there are a lot of cases that could look like tabular data, so I was wondering what sort of algorithm I'd need to research to look for common features.

My first thought was to write a long switch/case statement that checked for data seperated by tabs, and then another case for data separated by pipe symbols and then yet another case for data separated in another way etc etc. Now of course I realize that I would have to come up with a list of different things to detect - but I wondered if there was a more intelligent way of detecting these features than doing a relatively slow search for each type.

I realize this question isn't especially eloquently put so I hope it makes some sense!

Any ideas?

(no idea how to tag this either - so help there is welcomed!)

+1  A: 

The only reliable scheme would be to use machine-learning. You could, for example, train a perceptron classifier on a stack of examples of tabular and non-tabular materials.

bmargulies
What would the feature vectors [that would be examined]? I'm not so sure that the training would be able to find incorrectly formatted rows.
monksy
+1  A: 

A mixed solution might be appropriate, i.e. one whereby you handled the most common/obvious cases with simple heuristics (handled in "switch-like" manner) as you suggested, and to leave the harder cases, for automated-learning and other types of classifier-logic.

mjv
+1  A: 

This assumes that you do not already have a defined types stored in the TSV.

A TSV file is typically [Value1]\t[Value..N]\n

My suggestion would be to:

  1. Count up all the tabs
  2. Count up all of new lines
  3. Count the total tabs in the first row
  4. Divide the total number of tabs by the tabs in the first row

With the result of 4, if you get a remainder of 0 then you have a candidate of TSV files. From there you may either want to do the following things:

  1. You can continue reading the data and ignoring the error of lines with less or more than the predicted tabs per line
  2. You can scan each line before reading to make sure all are consistent
  3. You can read up to the line that does not fit the format and then throw an error

Once you have a good prediction of the amount of tab separated values you can use a regular expression to parse out the values [as a group].

monksy