ansaurus

Question

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

Answer 1

+3 A:

I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.

JoshD 2010-10-03 22:12:20

You might try a radix tree for letter by letter lookups. It would also allow you to easily tell if there were longer potential matches.

Jagerkin 2010-10-03 22:39:25

Oh, wow. I had thought of something like that (a tree with 26 children on each node, was my thought), but my boss said it was a ludicrous idea. I gotta stop listening to him. >:(

JoshD 2010-10-03 23:00:03

+1 I had actually thought about something like this a while ago. I even ran into the same problems: "hi there" == "hithere" == "hit here". "to get her" == "together" == "together".

inspectorG4dget 2010-10-04 00:40:03

Answer 2

+2 A:

First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.

Once you have the words, you have two possible approaches:

You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.

On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.

I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.

Radomir Dopieralski 2010-10-03 22:16:19

Answer 3

+1 A:

I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.

inspectorG4dget 2010-10-04 00:41:25

Answer 4

+1 A:

This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution

 textstring = "likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."

indiv_characters = list(textstring) #splits string into individual characters

teststring = ''
sequential_indiv_word_list = []

for cur_char in indiv_characters:
    teststring = teststring + cur_char
    # do some action here to test the testsring against an English dictionary where you can API into it to get True / False if it exists as an entry
    if in_english_dict == True:
        sequential_indiv_word_list.append(teststring)
        teststring = ''

#at the end just assemble a sentence from the pieces of sequential_indiv_word_list by putting a space between each word

There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.

Also you need to account for punctuation, write conditionals like

if cur_char == ',' or cur_char =='.':
   #do action to start new "word" automatically

Rick 2010-10-04 00:50:08

ansaurus

tags:

views:

answers:

Justadistraction: tokenizing English without whitespaces. Murakami SheepMan

related questions