views:

492

answers:

7

The title of this is kind of awkward; I wasn't really sure how to sum this up. I know how I can do this, I'm just not sure how to do it efficiently. Here's my problem:

I have a string as input. Let's say:

foo bar

And I have a very large set of strings (tens of thousands). Let's say:

foo, baz, bar, blah, foo bar, foo baz

I need to match the input to strings in the set. In this case, "foo", "bar", and "foo bar" are considered matches.

So, I need to either somehow search all permutations of the input (it could be longer than 2 words), or somehow detect if the user meant to put it (or part of it) in quotes. Or maybe do something I haven't thought of.

Is there some kind of data structure or algorithm I can use for this? How should I go about doing it, or should I not handle this use case?

EDIT: There's a typo above that distorted the problem; in the above example, "foo baz" is also a match. Sorry about that. I essentially want to match any permutation of the input words to the dictionary. So an input of "abc xyz" would match to "123 abc" or "abc xyz" or "xyz 123", but not "abcxyz".

+1  A: 

What you need is Lucene

Dennis Cheung
@Dennis: The page - www.google.com/?q=Lucene - does not exist. Perhaps you meant: http://lucene.apache.org/java/docs/
CPerkins
+2  A: 

I'd suggest using a dictionary. Use strings as keys and a list of strings as the value. Tokenize the strings that will be searched, and add the entire string to your dictionary once for each token. (Youn can use the split method to tokenize your strings. Use space as the delimiter.) Thereafter, whenever you need to do a look up, you tokenize the search string and perform a look up for each token in your dictionary.

Thus if you have added the following strings: foo, baz, bar, blah, foo bar, foo baz

Your dictionary has entries:

foo: foo, foo bar, foo baz baz: baz, foo baz bar: bar, foo bar blah: blah

Should you then search for "foo bar",

your output is the union of the entries stored under foo and bar like so: "foo bar":= foo, bar

foo: foo, foo bar, foo baz union bar: bar, foo bar

giving: foo, foo bar, foo baz, bar

EDIT: I've just noticed that you only want complete or partial matches, i.e. foo baz is not acceptable. The easy solution is to post process the results - restrict the longer of the search string and target string to the length of the shorter one and then compare the truncated string with the unmodified string. Accept only those that are equivalent.

EDIT: So it turns out foo baz is indeed a match. Disregard the paragraph above (first edit). See code as follows:

class DictionarySearch

{ private Dictionary> dict;

    public DictionarySearch()
    {
        dict = new Dictionary<string, List<string>>();
    }

    /// <summary>
    /// Add a string e.g. foo bar to the dictionary
    /// </summary>
    /// <param name="s">string to be added</param>
    public void addString(string s)
    {
        //tokenize string
        string[] words = s.Split(new char[] { ' ' });

        //add each token to the dictionary as a key with the matching value being s
        foreach (string w in words)
        {
            if (dict.ContainsKey(w))
            {
                dict[w].Add(s);
            }
            else
            {
                dict.Add(w, new List<string>());
                dict[w].Add(s);
            }
        }
    }
    /// <summary>
    /// Find all strings which match at least one token
    /// </summary>
    /// <param name="s">string of tokens (words) to be matched</param>
    /// <returns>List of strings matching at least one word</returns>
    public IList<string> getMatches(string s)
    {
        //split search string into words
        string[] words = s.Split(new char[] { ' ' });
        List<string> output = new List<string>();

        //retrieve from dictionary list of strings matching each word.
        foreach (string w in words)
        {
            if (dict.ContainsKey(w))
            {
                output.AddRange(dict[w]);
            }
            else
            {
                continue;
            }
        }

        return output;
    }
}

Given a dictionary with m strings with q words per string and n unique words, and a search string with l words the time complexities are as follows:

Populate data structure: O(q*m*T[dictionary-insert]). An insert needs to be performed for each word

Find a string: O(l*T[dictionary-find]). A dictionary lookup per word in search string.

The actual cost depends on your dictionary implementation. A hash table based dictionary incurs O(1) cost for both insert and find. A binary tree based dictionary incurs O(lg n) cost for both insert and find.

ejspencer
A: 

This code works. Don't know if that's efficient enough for you:

    String[] dict = "foo bar".split(" ");

 String[] array = new String[] { "foo", "baz", "bar", "blah", "foo bar",
   "foo baz" };

 loop: for (String s : array) {
  String[] a = s.split(" ");

  for (String sample : dict)
   for (String s1 : a)
    if (sample.equals(s1)) {
     System.out.println(s);
     continue loop;
    }
 }
tulskiy
+1  A: 

(When you say "efficient", you probably need to be more explicit in terms of space, and time. Lets assume you mean time efficiency (given that you mentioned permutations)).

The task of computing the answer to

String[] findStringsContaining(List<String> strings, String[] words)

can be partitioned and handed off to parallel threads of execution, given that it is purely functional and side effect free in an intermediate stage, and, the results joined as a final step. I.e. you can partition across the words, and/or, the list of strings.

This is how map-reduce works (and your case, its irrelevant that its all happening on the same machine.)

Your mapper (assigned to a thread for each of the words) is:

boolean [] stringContainsWord (List<String> strings, String word);

That method would execute in parallel.

The boolean array would then have a true for each index (of List) that matched for the given word.

and your reducer (running after all mappers have finished) is:

List<String> getMatchingList(List<String>, List<boolean[]> mapperResults);

Putting aside the overhead for the threads and assuming negligible costs for mapper thread counts for a reasonable number of input words, this would give you a O(n) (for the mapper) +O(m) (for the reducer) time process, where n is the number of items in your list of strings, and m is the number of words in your input.

You can further parallelize the task by partitioning your list of Strings and running p threads for each of the words, and having each thread searching a sub-set of your list of Strings, so that the input List to your mapper is 1/p elements of the overall list.

--

An alternative approach which you may want to consider, specially if the list of strings is huge, and, the content is langauge (such as English), is to optimize given the fact that most languages have a fairly small set of words that constitute the bulk of sentences in that language. For example, if your list has 2 million English sentences, chances are that the list of unique words is many orders of magnitude smaller (say a few hundred).

In that case, you can have a Map of word -> sentences, and testing for matching sentences for any given word is reduced to a lookup in the map.

(Note that you can still combine the initial approach with this.)

A: 

from ejspencer's Idea I put this together

// Build the dictionary/data structure 
// O( [average split length]*n )
public static Dictionary<String,List<int>> BuildDictionary(String[] data)
{
    String[] temp;
    Dictionary<String,List<int>> dict = new Dictionary<String,List<int>>();
    for(int i = 0; i < data.length; i++)
    {
        temp = data[i].split(" ");
        for(int j = 0; j < temp.length; j ++)
        {
         if(dict.get(temp[j]) == null)
          dict.put(temp[j],new List<int>());

         dict.get(temp[j]).add(i);
        }
    }

    return dict;
}

// find all the matches
// O( [average number of matches per key]*[input split length])
public static List<int> FindMatches(String input, Dictionary<String,List<int> dict)
{
    String[] temp = input.split(" ");
    List<int> ret = new List<int>();

    for(int i = 0; i < temp.length; i++)
    {
        if(dict.get(temp[i]) == null)
         continue; // no match

        // read the match into the return list, ignore copies
        List<int> match = dict.get(temp[i]);
        for(int j = 0; j < match.count(); j++)
         if(!ret.contains(match.get(i))
          ret.add(match.get(i));
    }

    return ret;
}

it probably won't compile right off, but I figure you're going to have to futz with it anyway and this gives you a pretty good idea for fast access and simple code (no offense alphazero).

This search is case sensitive, but you can just as eaily use toUpper or toLower to change it.

CodePartizan
+2  A: 

How big is your dictionary? You could convert your dictionary to a trie. There have been posts by people how to convert a dictionary to a trie. Once you do that, lookup is simple and fast.

Also, a simple solution could be to break up the search string into separate words, and search each of them in your trie, making sure duplicates are not considered twice.

+1  A: 

For large input strings and dictionaries with multi word phrases consider either the Rabin-Karp or Aho-Corasick algos.

(Link to Rabin-Karp - http://en.wikipedia.org/wiki/Rabin–Karp_string_search_algorithm - for some reason I could not get it to hyperlink the reference above)

Joel