views:

184

answers:

9

I have data which in theory is a list, but historically has been input by the user as a free form text field. Now I need to separate each item of the list so that each element can be analysed.

Simplified examples of my data as input by users:

one, two, three, four, five 

one. two. three, four. five.

"I start with one, then do two, maybe three and four then five"

one  
two  
three  
four  
five.  

one, two. three four five

one two three four - five

"not even a list, no list-elements here! but list item separators may appear. grrr"

So, that's more or less what the data looks like. In reality a list item could be several words long. I need to process these lists (of which there are thousands) such that I end up arrays like this:

array[0] = "one"  
array[1] = "two"  
array[n] = n

I accept that sometimes my algorithm will completely fail to parse the list, I don't need a 100% success rate, 75% would be good. False positives are going to be very expensive for me so I would rather reject a list completely than generate a list that does not contain real data - assume some users type in meaningless gibberish.

I have some ideas around trying to identify which separator(s) is being used and how regularly data is separated in relation to the size of the content.

I prefer Java or Python, however any solution would be welcome :-)

A: 

In java, a string tokenizer will do this (i.e. StringTokenizer(inputString, delimiterList))

StringTokenizer st = new StringTokenizer( "A B|C-D", " |-" );
while ( st.hasMoreTokens() ) {
    System.out.println( st.nextToken() );
}

prints

A

B

C

D

Steve B.
A: 

The following will "parse" your input string into sequences of "word" characters separated by non-word characters.

String input = ...
String[] parts = input.split("[^\w]*");

I don't know how you are going to distinguish a list from gibberish. I think you will need to explain your problem domain some more ...

EDIT: If you cannot define the rules which you (as a human) use to distinguish a list from gibberish, then this problem is essentially unsolvable. Computers cannot do magic you know ...

Maybe you should just use the program to deal with the subset that are "definitely" lists, and classify the other ones by hand.

Stephen C
I don't need to just split the text into tokens. Some of the tokens will be part of the list, indeed, sometimes the text will not even be a list, just plain text - I need to eliminate these ones. This is easy for a human to do, we can look at text and say, "yes, that's a list of data", or "no, that user just types an sentence, but fragments of the sentence may appear to look like a list"
zuki
...though be aware that `\w` is only suitable for matching English text.
McDowell
+1  A: 

I don't know if I understand your problem. If you want to extract alpha-numeric strings from messed string in python it would be:

>>> import re
>>> re.split('\W+','abaa, asodf ?. poasid - paosfi sec')
['abaa', 'asodf', 'poasid', 'paosfi', 'sec']

Or if you know the seperators:

>>> re.split('[,. -]+','abaa, asodf, poasid - paosfi sec')
['abaa', 'asodf', 'poasid', 'paosfi', 'sec']
qba
The String class in Java supports `split()`, so you can use this code almost 1:1.
Aaron Digulla
This is what I would have suggested (if you hadn't done so already), although it has the problem of not really rejecting anything, which the asker needs.
TokenMacGuy
+2  A: 

If you can't define your data ("The words can be anything, I there is no way to know beforehand a dictionary of what any individual list can contain. They will not be only numbers... it could be a list of anything") then you have serious problems.

Specifically, if you can't define your data, your problem cannot be solved.

You can try playing with nltk.

You may be able to discard the "noise words" (",", ".", "i", "start", "with", "then", "do", etc.) What's left may be this undefinable "words can be anything" that's left over.

Until you can better define your data, you're probably doomed to a lot of struggle.

S.Lott
+3  A: 

The first step to solving this problem is to analyze, in detail, how it is that humans solve this problem. I'd break the problem down into two parts.

  1. How do humans distinguish between lists and non-lists? For example, is it because non-lists are grammatical English sentences? If that's the case, you may be able to use one of the available natural language processing toolkits to distinguish between lists and non-lists.

  2. How do humans identify the delimiters and list elements in lists? For example, do they recognize the list elements because of some particular domain knowledge? Or do they just recognize one of a small set of common delimiters? Are list elements always single words? If not, in what cases are they multiple words?

I'd also take a good look at several hundred samples, to see if there are any COMMON patterns that can be easily identified and parsed. If, for example, 30% of your entries are simple comma-separated lists then a regular expression will trivially identify and parse them. Perhaps a small set of regular expressions will address a large part of your corpus.

Finally, I assume that currently the data is not only entered and recognized by humans but also consumed by humans. Is your reason for breaking items into lists so that humans can be removed from the loop, or just to make the work easier for them? If the latter, I'd recommend providing them with BOTH the broken-up list elements and, as a backup, the originally-entered text. In other words, hedge your bets in case you get it wrong.

swillden
+1 Until you can express your rules in plain English, it will be impossible to express them in code.
Tom Leys
@swillden: I like your comments, you have understood my problem. 1. How do humans distinguish between lists and non-lists?I don't know the answer to this. Maybe it is to do with the ratio of separators to content? Lists are a pattern of content and separators. If that pattern is broken and there is too much content then it's not a list?if (count(words) / count(separators)) == 1 then //same number of separators as words, surely a list...as this value increases, it's less likely to be a list. Does this sound promising?
zuki
I can't tell you how people do it. I think your best bet is to look at a bunch of examples yourself and take note of how YOU distinguish. And try not to limit yourself to things you think you could calculate. Figure out the human process first and then start looking for pieces of it that can be automated. It might be a good idea to talk to some of the people who use this data and get their input, keeping in mind that people often have a hard time articulating how they do what they do.
swillden
My answer (using a bayesian filter) revolves around asking a computer to automate this human-process analysis
Tom Leys
A: 

Either you know your dictionary of words or you have a precedence order for list delimiters. Otherwise the problem is too ill defined for a computer to handle.

I suppose your precedence could be commas, dots, hyphens, spaces. So, this means that you split by commas in preference to splitting by dots and so on.

Alternatively you could keep on splitting by each successive delimiter, until you hit a delimiter that isn't in the text.

Jonathan Swift
A: 

I'm not sure what the best answer really is, But if you need to have few false positives, then perhaps what you should do is define a few patterns that are very likely to be lists, and strictly reject every other datum.

patterns = [
    re.compile(r'^\s*(\w+)(\s*,\s*(\w+))*\s*$'), 
    re.compile(r'^\s*(\w+)(\s*\.\s*(\w+))*\s*$'), 
    re.compile(r'^\s*(\w+)(\s*,\s*(\w+))*\s+and\s+(\w+)\s*^$')
]
acceptSet = [ line for line in candidateSet if 
              any(pattern.match(line) for pattern in patterns)]
TokenMacGuy
this is pragmatic and may eventually end up being my dumb but implementable solution
zuki
+1  A: 

Rather than focusing on the code, how about the method. Building a little on what swillden said...

If your lists are consumed by human users, you could ask them to correct you when you make a mistake (this correction is visible either to the person entering the text or to a later user viewing the text). If a given input looks a lot like a list but not enough to be sure, you show them the list and the raw input and ask them to choose.

To automatically categorise inputs as lists or as text you could create several metrics to base your decision on :

  • Given the separators (i.e [' ', '\t', ',', '.', 'and']) how many does this phrase use? expect one or two. Which ones?
  • Is the input composed of fragments (use some sort of grammar system) - fragments tend to indicate lists.
  • Does this input field (or context in your input) tend to have list items
  • The words in the list itself (some words might turn out to always mean a sentence or a list in your domain)

You then pass this information into a Bayesian filter and train it using your user's suggestions. Most of the items I mention would boil into special "keywords" that you tag an item with before you pass it into the filter. If the filter has a clear answer either way, treat it as a list or string. If the filter is uncertain, ask the user and use their answer to train the filter.

Edit

You could always train the system manually (i.e without exposing your system to the users) by first classifying lists using your existing scripts and then checking them by hand. Take a list of 500 inputs, run a filter looking for , or other easy lists and classify them as lists. Train the Bayesian filter on those (with everything else non-list) and then check the output by hand for all 500 for further training.

Each day someone could receive an email with all the edge cases for that day and could clink links in the email to correct the system if necessary.

As a side issue (relating to OP comment), in general Bayesian filters are much easier to implement, debug, test, analyse and scale than Neural networks.

Tom Leys
Good suggestions. Unfortunately I can't modify the original input behaviour. You are thinking along the right lines. What is the distribution of the separator within the text - I think that could be important. A lot of users seem to end their input with "more" type words, even when their input contains an otherwise nicely formatted list. A Bayesian filter could be an option...using a neural network has crossed my mind.
zuki
Thanks, you don't have to ask all users to classify the input, just your admins. Also as the system gets more trained it will become pretty accurate by itself - minimising human training. Note my changes to the post for more.
Tom Leys
A: 

A pyparsing stab at sift wheat from the chaff...

rawdata = """\
one, two, three, four, five
one. two. three, four. five.
"I start with one, then do two, maybe three and four then five"
one  
two  
three  
four  
five.  
one, two. three four five
one two three four - five
"not even a list, no list-elements here! but list item separators may appear. grrr"
a dog with a bone is a beautiful twosome""".splitlines()

from pyparsing import oneOf, WordStart, CharsNotIn, alphas, LineEnd
options = (WordStart() + oneOf("one two three four five") + (CharsNotIn(alphas)|LineEnd()))

for userinput in rawdata:
    print userinput
    print [opt[0] for opt in options.searchString(userinput)]
    print

Prints (note the added line with hidden 'one' and 'two' substrings that would not be desirable):

one, two, three, four, five
['one', 'two', 'three', 'four', 'five']

one. two. three, four. five.
['one', 'two', 'three', 'four', 'five']

"I start with one, then do two, maybe three and four then five"
['one', 'two', 'three', 'four', 'five']

one  
['one']

two  
['two']

three  
['three']

four  
['four']

five.  
['five']

one, two. three four five
['one', 'two', 'three', 'four', 'five']

one two three four - five
['one', 'two', 'three', 'four', 'five']

"not even a list, no list-elements here! but list item separators may appear. grrr"
[]

a dog with a bone is a beautiful twosome
[]
Paul McGuire
hmmm nice job Paul, however I am not going to know the words that make the list items.
zuki