I have data which in theory is a list, but historically has been input by the user as a free form text field. Now I need to separate each item of the list so that each element can be analysed.
Simplified examples of my data as input by users:
one, two, three, four, five
one. two. three, four. five.
"I start with one, then do two, maybe three and four then five"
one
two
three
four
five.
one, two. three four five
one two three four - five
"not even a list, no list-elements here! but list item separators may appear. grrr"
So, that's more or less what the data looks like. In reality a list item could be several words long. I need to process these lists (of which there are thousands) such that I end up arrays like this:
array[0] = "one"
array[1] = "two"
array[n] = n
I accept that sometimes my algorithm will completely fail to parse the list, I don't need a 100% success rate, 75% would be good. False positives are going to be very expensive for me so I would rather reject a list completely than generate a list that does not contain real data - assume some users type in meaningless gibberish.
I have some ideas around trying to identify which separator(s) is being used and how regularly data is separated in relation to the size of the content.
I prefer Java or Python, however any solution would be welcome :-)