tags:

views:

278

answers:

6

I have got many Strings. All of them contain only characters. Characters and words are not splittet with a space from each other. Some of the characters form english words and other just bufflegab. The Strings may not contain a whole sentence.

I need to find out which of them are written in valid english speech. What I mean with that is, that the String could be build by concatenating well written english words. I know I could do something with a wordlist. But the words are not splittet from each other. So it could be very time-consuming to test every possible word combination.

I am searching for an high performance algorithm or method that check if the strings are built of english words or english speech. Maybe there is something that gives me the chance that the string contains english speech.

Do you know a method or algorithm that helps me? Does something like Sphinx help me?

+2  A: 

This is called the segmentation problem.

There is no trivial way to solve this. What I can suggest to you based on my guess of your knowledge level, is to build a trie out of your dictionary, and at the first chance you detect a possible word, try assuming that it is the word.

If later on, you find out that the last part of the word is gibberish, then you backtrack to the last time you decided a sequence of letter was a word, and ignore that word.

Unknown
A: 

Sphinx probably won't help you. Try Rabin-Karp algorithm. It is awful for standard search but should work well for this particular problem. Basically, you'll want to have a dictionary of English words and will want to search with it. Overly large dictionaries will still be pretty slow, but if you use a small dictionary for common words and switch to a big one only when you hit common words, you probably still won't get too many false negatives.

Brian
A: 

Why not store your wordlist in a Trie. Then you iterate through the input looking for matching words in the Trie - this can be done very efficiently. If you find one, advance to the end of the word and continue.

1800 INFORMATION
+1  A: 

Check N-gram language model.

See http://en.wikipedia.org/wiki/N-gram

Igor Krivokon
+2  A: 

If your strings are long enough or your bufflegab strange enough, letter frequencies - possibly also bigram frequencies, trigram frequencies, etc. - might be sufficient (instead of the more general N-grams). For example, some browsers use that to guess the code page.

stephan
This is what I would have suggested. It may be on the naive side, but I would expect this approach to run fairly quickly, be relatively easy to implement, and still provide useful (if suboptimal) results.
John Y
A: 

It depends on what accuracy you want, how efficient you need it to be, and what kind of text you are processing.