views:

78

answers:

3

I'm on OS X, and in objective-c I'm trying to convert

for example, "Bobateagreenapple"

into "Bob ate a green apple"

Is there any way to do this efficiently? Would something involving a spell checker work?

EDIT: Just some extra information: I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.

A: 

I don't think there is any way on doing that automatic, even using a dictionary as a reference, how would it know what is Bob, in order to cut the word there? It will have to analyze also grammar and is not even that simple.

You might be doing something wrong if you got up to this point and you need that mandatory...

Rama
+1  A: 

Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.

A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.

This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.

Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it

zem
It could also be "Boba tea…", which is a fairly popular drink that does indeed come in apple flavors, so even a sophisticated analysis of how often phrases occur and in what contexts could get thrown off.
Chuck
+1  A: 

One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.

Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.

Pseudo-code would look something like this:

FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
    if (l.empty() and w.empty())
        add s to sentences;
        return;
    if (l.empty())
        return;
    add first letter from l to w;
    if w in dictionary
    {
        add w to s;
        FindWords(sentences, s, empty word, l)
        remove w from s
    }
    FindWords(sentences, s, w, l)
    put last letter from w back onto l
}

There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.

Nathan S.