views:

30

answers:

1

I am looking for an algorithm that will efficiently separate a search string into an array of known search phrases. For instance, if I type "Los Angeles pizza" it needs to know I am looking for "los Angeles" and "pizza", not "Los" and "Angeles pizza".

This is for a specialized search application, assume I have a dictionary of all phrases people will use.

A: 

The Google N-gram Corpus could be used to determine the most likely phrase divisions.

For reasonably short phrases, you could generate all the possible sets of n-grams that the phrase can be divided into (e.g. ["Los", "Angeles", "pizza"], ["Los Angeles", "pizza"], ["Los", "Angeles pizza"] and ["Los Angeles pizza"] for your example phrase), look them up in the corpus, and see which one(s) come out with the highest number of occurrences. (Considering the size of the corpus, you'll probably need to load it into a database rather than an in-memory hashtable.)

EDIT: By the looks of things, it's not freely available. Maybe there are some similar things that you could use, though. If not, there are certainly corpora of text from the web that you can download and use to create your own lists of n-grams.

David