ansaurus

Question

Counting English words in a random string

Answer 1

A:

A start:

How do I determine if a random string sounds like English?

Leniel Macaferi 2010-09-08 03:09:57

Thanks... but I don't need to build a FSM for this. I have a string, and I have a list of words. This is merely a comparison task... just wondering what is the best DS to do this (space and time).

Dervin Thunk 2010-09-08 03:15:11

Answer 2

+8 A:

I would load the dictionary words in a Trie structure, then read the string from left to right and check if the substrings are in the trie. If they are and there are children, keep going. If they happen to be a leaf or a valid word, add to the occurence count.

In pseudo code:

Trie dict = ... // load dictionary
Dictionary occurences = {}

for i in length(string):
    j = i + 1
    # think of partial as string.Substring(i, j);
    while dict.hasChildren(partial):
        j++ 
        if isWord(partial):
            dict[partial]++

This way you'll guarantee it doesn't miss a match while still looking for all possibilities.

You can limit the minimum length of the valid words by changing what j is initialized to or by rejecting short words in the isWord() method (so a wouldn't be a "valid" word).

NullUserException 2010-09-08 03:15:05

This should be more than enough to start with. Thanks!

Dervin Thunk 2010-09-08 03:26:04

Answer 3

+6 A:

The Aho-Corasick string matching algorithm builds the matching structure in time linear in the size of the dictionary and matches patterns at time linear in the size of the input text + number of matches found.

mcdowella 2010-09-08 04:39:20

+1: A trie is good, but a trie + a good search algorithm is far better.

James McNellis 2010-09-08 04:42:24

Nice complement. Upvoted.

Dervin Thunk 2010-09-08 14:57:30

ansaurus

tags:

views:

answers:

Counting English words in a random string

related questions