views:

90

answers:

4

I have a database full of phrases (80-100 characters), and some longish documents (50-100Kb), and I would like a ranked list of phrases for a given document; rather than the usual output of a search engine, list of documents for a given phrase.

I've used MYSQL fulltext indexing before, and looked into lucene, but never used it. They both seem geared to compare the short (search term), with the long (document).

How would you get the inverse of this?

A: 

Would it be too slow to turn each phrase into a regex and run each one on the document, counting the number of occurrences?

If that doesn't work, maybe you can combine all the phrases into one huge regex (using |), and compile it. Then, run that huge regex starting from every character in the document. Count the number of matches as you go through the characters.

Claudiu
I can trade time to build an index, so that looking up a list of phrases (for a given document) is as fast as possible.
Tourch
A: 

How large is the database of phrases? I am assuming that it is very large.

I would do the following:

  1. Index the phrases by one of the words in it. You might choose the least common word in each phrase.You might make the search better by assuming that the word is at least e.g. 5 characters long, and padding the word to 5 chars if it is shorter. The padding can be the space after the word, followed by the subsequent word, to reduce matches, or some default character (e.g. "XX") if the word occurs at the end of the phrase.

  2. Go through your document, converting each word (common ones can be discarded) to a key by padding if necessary, retrieving phrases.

  3. Retrieve the relevant phrases by these keywords.

  4. Use an in-memory text search to find the number of occurrences of each of the retrieved phrases.

  5. I am assuming that phrases cannot cross a sentence boundary. In this case, you can read each sentence of the document into a substring in an array, and use the substring function to search through each sentence for each of the phrases and count occurrences, keeping a running sum for each phrase.

Larry Watanabe
+2  A: 

I did something similar with a database of Wikipedia titles and managed to get down to a few hundred milliseconds for each ~50KB document. That was still not fast enough for my needs, but maybe it can work for you.

Basically the idea was to work with hashes as much as possible and only do string comparisons on possible matches, which are pretty rare.

First, you take your database and convert it into an array of hashes. If you have billions of phrases, this may not be for you. When you calculate the hash, be sure to pass the phrases through a tokenizer that will remove punctuation and whitespace. This part needs to be done only once.

Then, you go though the document with the same tokenizer, keeping a running list of the last 1,2,..,n tokens, hashed. At every iteration, you do a binary search of the hashes you have against the hashes database.

When you find a match, you do the actual string comparison to see if you found a match.

Here's some code, to give you a taste of whet I mean, tough this example doesn't actually do the string comparison:

            HashSet<Long> foundHashes = new HashSet<Long>();

            LinkedList<String> words = new LinkedList<String>();
            for(int i=0; i<params.maxPhrase; i++) words.addLast("");

            StandardTokenizer st = new StandardTokenizer(new StringReader(docText));
            Token t = new Token();
            while(st.next(t) != null) {
                String token = new String(t.termBuffer(), 0, t.termLength());
                words.addLast(token);
                words.removeFirst();

                for(int len=params.minPhrase; len<params.maxPhrase; len++) {
                    String term = Utils.join(new ArrayList<String>(words.subList(params.maxPhrase-len,params.maxPhrase)), " ");

                    long hash = Utils.longHash(term);

                    if(params.lexicon.isTermHash(hash)) {
                        foundHashes.add(hash);
                    }
                }
            }

            for(long hash : foundHashes) {
                if(count.containsKey(hash)) {
                    count.put(hash, count.get(hash) + 1);
                } else {
                    count.put(hash, 1);
                }
            }
itsadok
A few hundred milliseconds is acceptable. I'll give this approach a go
Tourch
A: 

Maybe reading Peter Turney on keyphrase extraction will give you some ideas. Overall, his approach has some similarity to what itsadok has suggested.

Yuval F