views:

430

answers:

10

I have a lot of compound strings that are a combination of two or three English words.

    e.g. "Spicejet" is a combination of the words "spice" and "jet"

I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.

What would be the most efficient by which I can separate individual English words from such compound strings.

+1  A: 

It sounds to me like you want to store you dictionary in a Trie or a DAWG data structure.

A Trie already stores words as compound words. So "spicejet" would be stored as "spice*jet*" where the * denotes the end of a word. All you'd have to do is look up the compound word in the dictionary and keep track of how many end-of-word terminators you hit. From there you would then have to try each substring (in this example, we don't yet know if "jet" is a word, so we'd have to look that up).

Niki Yoshiuchi
You should either describe Tries and DAWGs, or give links to pages that do. Its possible that not everyone in the world will immediately know what you are talking about ;-)
A. Levy
Come on DAWG, everybody knows what a Trie is :)
Alex
You're right. For some reason I decided to check stack overflow late last night, and saw this questioned. I didn't want to go into detail because I wanted to go to sleep, but felt compelled to answer.
Niki Yoshiuchi
+2  A: 

And how will you decide how to divide things? Look around the web and you'll find examples of URLs that turned out to have other meanings.

Assuming you didn't have the capitals to go on, what would you do with these (Ones that come to mind at present, I know there are more.):

PenIsland
KidsExchange
TherapistFinder

The last one is particularly problematic because the troublesome part is two words run together but is not a compound word, the meaning completely changes when you break it.

Loren Pechtel
Also - you'd probably want to leave GameCocks together if you're from South Carolina, but perhaps not if you're from PenIsland.
Anon
When there is any such ambiguity i will have to make a decision whether to use the longest or the shortest words. But, in any case, first I will have to find out all possible ways to split the input into meaningful words.
Manas
+2  A: 

So, given a word, is it a compound word, composed of two other English words? You could have some sort of lookup table for all such compound words, but if you just examine the candidates and try to match against English words, you will get false positives.

Edit: looks as if I am going to have to go to provide some examples. Words I was thinking of include:

accustomednesses != accustomed + nesses
adulthoods != adult + hoods
agreeabilities != agree + abilities
willingest != will + ingest
windlasses != wind + lasses
withstanding != with + standing
yourselves != yours + elves
zoomorphic != zoom + orphic
ambassadorships != ambassador + ships
allotropes != allot + ropes

Here is some python code to try out to make the point. Get yourself a dictionary on disk and have a go:

from __future__ import with_statement

def opendict(dictionary=r"g:\words\words(3).txt"):
    with open(dictionary, "r") as f:
     return set(line.strip() for line in f)

if __name__ == '__main__':
    s = opendict()
    for word in sorted(s):
     if len(word) >= 10:
      for i in range(4, len(word)-4):
       left, right = word[:i], word[i:]
       if (left in s) and (right in s):
        if right not in ('nesses', ):
         print word, left, right
hughdbrown
+7  A: 

I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.

You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.

I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.

You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.

Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.

Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.

Sounds like a fun problem, good luck!

jscharf
+1  A: 

It occurs to me that there are a relatively small number of substrings (minimum length 2) from any reasonable compound word. For example for "spicejet" I get:

'sp', 'pi', 'ic', 'ce', 'ej', 'je', 'et',
'spi', 'pic', 'ice', 'cej', 'eje', 'jet',
'spic', 'pice', 'icej', 'ceje', 'ejet',
'spice', 'picej', 'iceje', 'cejet',
'spicej', 'piceje', 'icejet',
'spiceje' 'picejet'

... 26 substrings.

So, find a function to generate all those (slide across your string using strides of 2, 3, 4 ... (len(yourstring) - 1) and then simply check each of those in a set or hash table.

Jim Dennis
+1  A: 

Word existence could be done with a trie, or more simply with a set (i.e. a hash table). Given a suitable function, you could do:

# python-ish pseudocode
def splitword(word):
    # word is a character array indexed from 0..n-1

    for i from 1 to n-1:
        head = word[:i]  # first i characters
        tail = word[i:]  # everything else

        if is_word(head):
            if i == n-1:
                return [head]   # this was the only valid word; return it as a 1-element list
            else:
                rest = splitword(tail)
                if rest != []:   # check whether we successfully split the tail into words
                    return [head] + rest

    return []  # No successful split found, and 'word' is not a word.

Basically, just try the different break points to see if we can make words. The recursion means it will backtrack until a successful split is found.

Of course, this may not find the splits you want. You could modify this to return all possible splits (instead of merely the first found), then do some kind of weighted sum, perhaps, to prefer common words over uncommon words.

John Fouhy
+1  A: 

A similar question was asked recently: Word-separating algorithm. If you wanted to limit the number of splits, you would keep track of the number of splits in each of the tuples (so instead of a pair, a triple).

Martin Hock
A: 

I would use the following algorithm.

  1. Start with the sorted list of words to split, and a sorted list of declined words (dictionary).

  2. Create a result list of objects which should store: remaining word and list of matched words.

  3. Fill the result list with the words to split as remaining words.

  4. Walk through the result array and the dictionary concurrently -- always increasing the least of the two, in a manner similar to the merge algorithm. In this way you can compare all the possible matching pairs in one pass.

  5. Any time you find a match, i.e. a split words word that starts with a dictionary word, replace the matching dictionary word and the remaining part in the result list. You have to take into account possible multiples.

  6. Any time the remaining part is empty, you found a final result.

  7. Any time you don't find a match on the "left side", in other words, every time you increment the result pointer because of no match, delete the corresponding result item. This word has no matches and can't be split.

  8. Once you get to the bottom of the lists, you will have a list of partial results. Repeat the loop until this is empty -- go to point 4.

Sklivvz
+4  A: 

If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.

The pseudo code is something like this:

parseWords(compoundWord)
    # Make the graph
    graph = makeGraph()
    N = compoundWord.length
    for index = 0 to N
        graph.addVertex(i)

    # Add the edges for each word
    for index = 0 to N - 1
        for length = 1 to min(N - index, MAX_WORD_LENGTH)
            potentialWord = compoundWord.substr(index, length)
            if dictionary.isElement(potentialWord)
                graph.addEdge(index, index + length, 1)

    # Now find a list of edges which define the shortest path
    edges = graph.shortestPath(0, N)

    # Change these edges back into words.
    result = makeList()
    for e in edges
        result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
    return result

I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.

You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.

A. Levy
A: 

This can be a very difficult problem and there is no simple general solution (there may be heuristics that work for small subsets).

We face exactly this problem in chemistry where names are composed by concatenation of morphemes. An example is:

ethylmethylketone

where the morphemes are:

ethyl methyl and ketone

We tackle this through automata and maximum entropy and the code is available on Sourceforge

http://www.sf.net/projects/oscar3-chem

but be warned that it will take some work.

We sometimes encounter ambiguity and are still finding a good way of reporting it.

To distinguish between penIsland and penisLand would require domain-specific heuristics. The likely interpretation will depend on the corpus being used - no linguistic problem is independent from the domain or domains being analysed.

As another example the string

weeknight

can be parsed as

wee knight

or

week night

Both are "right" in that they obey the form "adj-noun" or "noun-noun". Both make "sense" and which is chosen will depend on the domain of usage. In a fantasy game the first is more probable and in commerce the latter. If you have problems of this sort then it will be useful to have a corpus of agreed usage which has been annotated by experts (technically a "Gold Standard" in Natural Language Processing).

peter.murray.rust